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ABSTRACT 


Waveform  relaxation  algorithms  for  the  simulation  of  MOS  circuits  exhibit 
natural  parallelism,  arising  from  the  intrinsic  partitioning  of  the  circuit  into  subcircuits 
which  are  solved  separately  during  the  iterative  solution  process.  Investigated  in  this 
thesis  is  the  extent  to  which  the  overall  run  time  of  a  simulation  can  be  reduced  by 
utilizing  the  natural  parallelism  of  waveform  relaxation  on  parallel  processors.  Four 
parallel  waveform  relaxation  algorithms  are  considered,  based  on  the  Gauss-Seidel  and 
Gauss-Jacobi  relaxation  methods,  in  which  parallelism  is  exploited  at  the  individual 
time  point  level  or  at  the  time  window  level.  The  algorithm  with  the  fastest  run  time 
depends  on  the  characteristics  of  the  circuit  being  simulated  and  on  the  number  of  pro¬ 
cessors  used.  The  Gauss-Jacobi  method  with  time  point  pipelining  is  introduced  as  a 
highly  parallel  algorithm  which  can  outperform  the  other  algorithms  when  the  number 
of  processors  is  large. 

A  theorem  is  presented  comparing  the  Gauss-Seidel  and  Gauss-Jacobi  methods,  as 
applied  to  the  solution  of  a  set  of  linear  algebraic  equations  of  the  type  which  occur  at 
each  time  point  in  the  simulation  of  MOS  circuits.  Gauss-Jacobi  is  shown  to  be  asymp¬ 
totically  faster  than  Gauss-Seidel  when  the  number  of  processors  is  sufficiently  large. 

Simplified  speedup  estimates  are  used  in  a  presimulation  selection  procedure  which 
selects  the  fastest  of  the  parallel  waveform  relaxation  algorithms  prior  to  performing 
the  simulation  of  a  given  circuit  on  a  given  number  of  processors.  More  accurate  esti¬ 
mates  of  the  potential  parallel  processing  speedups.  neglecting  overhead,  are  produced 
by  the  PARASITE  parallel  simulation  time  estimator,  which  uses  CPU  lime  measure¬ 
ments  from  a  uniprocessor  simulation  to  estimate  the  parallel  run  time  on  any  number 
of  processors.  PARASITE  estimates  indicate  that  speedups  of  about  one  order  of  magni- 
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tude  are  possible  for  1000-transistor  circuits  on  32  processors,  where  the  speedup  is 
measured  with  respect  to  Gauss-Seidel  waveform  relaxation  on  a  single  processor. 

The  parallel  waveform  relaxation  algorithms  have  been  implemented  in  programs 
which  run  on  an  8-processor  Alliant  FX/8  multiprocessor.  Speedups  within  11%  to 
21%  of  the  PARASITE  estimates  are  achieved. 
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CHAPTER  1 


INTRODUCTION 


Circuit  simulation  plays  a  key  role  in  verifying  that  very  large  scale  integrated 
(VLSI)  circuit  designs  are  correct  and  meet  performance  specifications  prior  to  fabricat¬ 
ing  the  circuits  [\ag75.  Wee73].  The  amount  of  computer  time  required  to  simulate  a 
significant  portion  of  a  large  circuit  can  be  prohibitive  when  detailed  models  of  the  cir¬ 
cuit  elements  are  used,  and  the  differential  equations  are  integrated  to  produce  accurate 
voltage  waveforms  as  a  function  of  time  [Whi86].  Simulators  that  use  models  based  on 
higher  levels  of  abstraction,  such  as  switch  and  logic  level  simulators  [Bry84.  Rao85. 
Szy72.  Haj83].  can  achieve  run  times  which  are  orders  of  magnitude  faster,  but  the 
results  are  not  as  precise.  As  integrated  circuit  technology  advances,  the  number  of  cir¬ 
cuit  elements  increases  at  the  same  time  that  feature  sizes  are  reduced  and  parasitic 
effects  become  more  important  in  determining  circuit  performance.  Consequently,  as 
the  need  for  faster  simulation  becomes  more  important,  the  ability  to  satisfy  this  need 
through  less  precise  simulation  techniques  diminishes  [Kan85], 

The  need  for  faster,  precise  circuit  simulation  that  is  motivated  by  advances  in 
integrated  circuit  technology  can  be  addressed  by  employing  the  fruits  of  these  techno¬ 
logical  advances  in  the  computers  which  are  used  to  simulate  circuits.  Multiprocessor 
computers,  comprised  of  several  processors  which  can  work  together  to  solve  a  single 
problem,  are  able  to  provide  a  large  amount  of  total  computing  power  at  a  reasonable 
cost  [Kuc86].  To  successfully  use  the  parallelism  in  the  computer  hardware  to  reduce 
the  overall  run  time  of  the  simulation  program,  the  set  of  computations  must  be  parti¬ 
tioned  into  subsets  which  can  be  executed  currently  on  different  processors,  while 
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preserving  the  integrity  of  the  final  solution. 

Early  attempts  to  exploit  parallelism  to  reduce  the  time  required  to  perform  cir¬ 
cuit  simulations  utilized  vector  processors,  which  achieve  performance  gains  from 
parallelism  when  all  the  elements  of  a  vector  are  processed  identically.  The  vector  ele¬ 
ments  are  pipelined  through  the  processor  hardware  such  that  different  elements  are 
processed  by  different  stages  of  the  hardware  simultaneously  [Hwa84].  The  hardware 
parallelism  is  limited  by  the  number  of  stages  in  the  pipeline.  The  effective  use  of  vec¬ 
tor  processing  in  standard  direct  method  circuit  simulation  algorithms  [Nag75.  Wee73. 
Yan80]  is  hampered  by  the  sparse,  irregular  interconnection  structures  of  circuits, 
which  result  in  highly  sparse  unstructured  matrices  that  are  not  efficiently  processed  by 
vector  operations.  When  circuit  elements  are  evaluated,  elements  described  by  identical 
equations  can  be  processed  in  the  vector  mode.  But  even  circuit  elements  that  use  ident¬ 
ical  models  operate  in  different  regions  at  different  points  in  time,  and  the  different 
regions  are  described  by  different  sets  of  equations,  thus  hindering  vectorization.  As  a 
result  of  these  problems,  significant  overhead  penalties  are  incurred  in  gathering  data 
into  vectors  which  can  utilize  the  fast  vector  processing  capabilities,  and  in  scattering 
the  results  of  vector  computations  back  to  their  ultimate  destinations  [Cal79.  Cal80. 
Vla82.  Ham83.  May83.  Yam85]. 

Parallel  processors  of  the  M1MD  (multiple  instruction,  multiple  data)  type  consist 
of  separate  processors  which  can  perform  independent  operations  on  independent  sets  of 
operands  [Fly 72].  The  processors  can  share  data  either  through  some  type  of  data  com¬ 
munication  mechanism  or  through  shared  memory.  Compared  to  a  vector  processor, 
this  type  of  architecture  offers  a  more  flexible  environment  for  exploiting  parallelism, 
because  the  concurrent  operations  need  not  be  identical  and  they  need  not  be  performed 
on  operands  which  are  organized  as  elements  of  a  vector.  The  use  of  parallel  processors 


for  direct  method  circuit  simulation  remains  an  active  area  of  research.  Good  perfor¬ 
mance  is  easy  to  obtain  in  evaluating  the  models  of  circuit  elements,  but  effective  paral¬ 
lelization  of  the  solution  of  the  sparse  unstructured  matrix  equations  is  more  difficult, 
due  to  the  large  number  of  fine-grained  data  dependencies  which  occur  in  Gaussian 
elimination  and  LU  factorization  [WinSO,  Bis86.  Cox87,  Jac87.  Nak87.  Sad87.  Sma87a]. 

Relaxation  methods  for  circuit  simulation  [New84]  involve  partitioning  the  circuit 
into  subcircuits  which  are  solved  independently,  while  treating  the  voltages  external  to 
each  subcircuit  as  if  they  were  independent  voltage  sources.  The  values  of  these  voltage 
sources  are  updated  with  the  solutions  of  neighboring  subcircuits  on  each  iteration  of 
the  iterative  solution  process.  These  algorithms  have  natural  parallel  implementations 
because  of  the  inherent  partitioning  of  the  circuit.  Relaxation  methods,  when  carried  to 
convergence,  produce  precise  voltage  waveforms  of  the  same  quality  as  standard  direct 
method  algorithms.  Different  forms  of  relaxation  algorithms  have  been  implemented  on 
parallel  processors  [Gal88].  including  nonlinear  algebraic  equation  relaxation  techniques 
[Deu84.  Web87],  waveform  relaxation  [Whi85b,  Uno85.  Mat86,  Dum87.  Sma87b. 
Sma88b],  and  waveform-Newton  [Sal87b]. 

Addressed  in  this  thesis  is  the  question  of  how  much  of  a  speed  improvement  in 
circuit  simulation  run  time  can  be  obtained  by  exploiting  the  natural  parallelism  of 
waveform  relaxation  [Lel82,  Whi86,  Rue87.  Hsi85]  on  parallel  processors.  Several 
different  parallel  waveform  relaxation  algorithms  are  described  in  Chapter  2.  based  on 
the  Gauss-Seidel  and  Gauss-Jacobi  (or  Jacobi)  relaxation  methods  [0rt70],  using  win¬ 
dow  level  parallelism  and  time  point  pipelining  (Whi85b].  The  Gauss-Jacobi  method  in 
combination  with  time  point  pipelining  is  shown  to  produce  an  algorithm  with  a  com¬ 
paratively  high  degree  of  parallelism,  which  is  effective  when  the  number  of  processors 
is  large  compared  to  the  size  of  the  circuit  [Sma88b]. 
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The  parallel  performances  of  the  Gauss-Seidel  and  Gauss- Jacobi  methods  are 
addressed  in  Chapter  3  from  a  theoretical  viewpoint.  A  theorem  is  presented  which 
shows  that  parallel  Gauss-Jacobi  is  asymptotically  faster  than  parallel  Gauss-Seidel 
when  the  number  of  processors  is  sufficiently  large,  under  certain  conditions  which 
apply  to  the  solution  of  linear  equations  arising  in  the  simulation  of  M05  circuits.  A 
formula  is  also  derived  for  the  average  ratio  of  parallelism  of  the  two  methods,  based 
on  the  nonzero  structure  of  the  matrix. 

In  Chapter  4.  speedup  estimates  are  computed  for  a  set  of  5  benchmark  circuits, 
for  the  competing  parallel  waveform  relaxation  algorithms.  The  speedups  indicate  how 
much  faster  the  algorithms  run  on  multiple  processors  compared  to  a  single  processor. 
Two  categories  of  speedup  estimates  are  considered.  Presimulation  estimates  are  based 
on  simplifying  assumptions  that  allow  the  estimates  to  be  computed  prior  to  perform¬ 
ing  a  simulation  of  the  circuit.  These  estimates  provide  first-order  insights  into  the 
sources  of  parallelism,  and  the  factors  which  inhibit  parallelism  in  real  circuit  exam¬ 
ples.  The  presimulation  estimates  also  provide  a  basis  for  selecting  one  of  the  algo¬ 
rithms  prior  to  simulating  a  particular  circuit  on  a  given  number  of  processors.  Post¬ 
simulation  estimates  are  more  accurate  than  presimulation  estimates  because  they  util¬ 
ize  detailed  information  obtained  in  the  simulation  of  the  circuit  on  a  uniprocessor. 
Post-simulation  estimates  are  used  to  generate  accurate  projections  of  the  potential  per¬ 
formance  of  the  algorithms  when  the  number  of  processors  is  increased  beyond  that 
which  is  currently  available.  Post-simulation  speedup  estimates  excluding  multipro¬ 
cessing  overhead  are  compared  in  subsequent  chapters  with  actual  multiprocessor  per¬ 
formance  results,  to  determine  the  extent  to  which  the  overhead  factors  impact  the  per¬ 
formance.  Speedup  estimates  excluding  overhead  have  been  used  previously  in  the 
study  of  parallel  iterated  timing  analysis  [Deu84], 
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The  parallel  waveform  relaxation  algorithms  have  been  implemented  in  two  pro¬ 
grams  which  run  on  an  Alliant  FX/S  multiprocessor,  using  up  to  8  processors.  These 
programs  are  described  in  Chapters  5  and  6.  Measured  performance  results  are  given 
for  the  parallel  implementations,  and  these  results  are  compared  with  the  speedup  esti¬ 
mates  of  Chapter  4. 

Latency  in  the  computations  of  waveform  relaxation  can  be  exploited  to  reduce 
the  number  of  computations  that  must  be  performed.  Reducing  the  number  of  compu¬ 
tations  reduces  the  run  time  on  a  uniprocessor,  and  may  or  may  not  reduce  the  run  time 
on  a  multiprocessor,  depending  on  the  data  dependencies  between  the  nonlatent  compur 
tations  and  on  the  overhead  involved  in  detecting  latency.  The  impact  of  latency 
exploitation  on  parallel  waveform  relaxation  is  addressed  in  Chapter  7.  The  primary 
form  of  latency  that  is  considered  is  iteration  latency,  which  has  also  been  called  partial 
waveform  convergence  in  previous  work  [Whi86].  The  impacts  of  latency  on  the  paral¬ 
lel  implementations  and  on  parallel  performance  are  discussed,  and  performance  results 
are  presented.  Finally,  conclusions  are  presented  in  Chapter  8. 


CHAPTER  2 


PARALLEL  WAVEFORM  RELAXATION  ALGORITHMS 


Waveform  relaxation  is  a  class  of  iterative  algorithms  for  solving  ordinary 
differential  equations.  When  applied  to  the  differential  equations  that  describe  the  vol¬ 
tages  of  a  circuit  as  a  function  of  time,  waveform  relaxation  is  guaranteed  to  converge, 
provided  that  the  circuit  equations  satisfy  certain  conditions.  These  conditions  are 
readily  satisfied  by  MOS  integrated  circuits  which  are  represented  by  node  equations, 
provided  that  the  inevitable  capacitance  from  each  node  to  ground  is  included  in  the 
equations  [New84].  As  in  other  relaxation  based  algorithms,  the  equations  are  parti¬ 
tioned  into  subsystems  which  are  solved  independently  during  the  iterative  process. 
Due  to  the  inherent  partitioning  of  the  equations,  waveform  relaxation  exhibits  natural 
parallelism  which  can  be  utilized  on  parallel  processors  to  reduce  the  overall  computa¬ 
tion  time. 

In  this  chapter,  the  basic  waveform  relaxation  algorithm  is  summarized.  The  form 
of  waveform  relaxation  implemented  in  the  RELAX2.3  program  [Whi85a.  Whi86]  is 
used  as  a  model  of  the  basic  algorithm  on  a  uniprocessor.  Different  approaches  for 
exploiting  parallelism  are  then  identified.  Both  the  Gauss-Jacobi  and  Gauss-Seidel 
relaxation  methods  are  considered,  and  the  techniques  of  exploiting  parallelism  at  the 
window  level  or  at  the  time  point  level  are  compared.  Task  graphs  representing  the 
compulations  and  their  interdependencies  are  introduced.  A  first-order  analysis  of  the 
parallel  algorithms  is  presented  based  on  the  task  graphs  and  some  simplifying  assump¬ 
tions.  The  combination  of  the  Gauss-Jacobi  relaxation  method  and  time  point  pipelining 
is  identified  as  the  algorithm  with  the  greatest  potential  parallelism  of  the  algorithms 
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considered.  The  performance  of  the  parallel  algorithms  is  investigated  in  greater  detail 
in  subsequent  chapters. 

2.1  Waveform  Relaxation 

Consider  a  circuit  with  N  nodes,  which  satisfies  a  system  of  differential  node 
equations  of  the  form 

q (v (r  f  (v (r ). u (r )) .  r€[0.r,]  (2.1) 

with  initial  conditions 

v  (0)  *  V  ,  (2.2) 

where  t  €R  is  time,  v:R-*R  is  the  vector  of  node  voltages.  V  €R  is  the  vector  of 

initial  node  voltages,  u  :R-*Rr  is  the  vector  of  known  source  values.  /  iR*  xRr  -*R'V 
is  the  vector  of  currents  flowing  into  charge  storage  elements  at  each  node. 
q  :R  xR  -*R  is  the  vector  of  charges  stored  at  the  nodes,  and  q  is  the  time  deriva¬ 
tive  of  q .  Prior  to  solving  the  equations  by  waveform  relaxation,  the  system  of  equa¬ 
tions  must  be  partitioned  into  subsystems.  Since  (2.1)  is  written  in  terms  of  node  equa¬ 
tions.  the  equation  partitioning  problem  is  equivalent  to  partitioning  the  set  of  circuit 
nodes  into  subsets  that  are  mutually  exclusive  and  exhaustive.  Each  subset  of  nodes, 
together  with  the  circuit  elements  connected  to  the  nodes,  represents  a  subcircuit.  To 
achieve  fast  convergence  speed  during  the  relaxation  iterations,  it  is  important  to  keep 
nodes  that  are  tightly  coupled  to  each  other  in  the  same  subcircuit. 

For  the  Gauss-Seidel  relaxation  method,  an  ordering  of  the  subcircuits  must  be 
defined.  For  fast  convergence,  this  ordering  should  reflect  the  predominant  direction  of 
signal  flow  in  the  circuit.  If  the  circuit  contains  n  subcircuits,  then  the  subcircuits  are 
assigned  numbers  from  1  to  n  such  that,  in  as  many  cases  as  possible,  strong  signals 
flow  from  lower  numbered  subcircuits  to  higher  numbered  subcircuits.  When  feedback 
paths  are  present,  some  signals  will  flow  from  higher  numbered  subcircuits  to  lower 


numbered  subcircuits,  and  these  signal  paths  can  lead  to  slow  convergence. 

To  counteract  the  negative  impact  of  global  and  local  feedback  on  convergence 
speed,  the  time  interval  [0.  tf  ]  is  partitioned  into  subintervals  called  windows.  Conver¬ 
gence  occurs  more  rapidly  at  the  beginning  than  at  the  end  of  a  window.  Consequently, 
if  the  size  of  a  window  is  reduced,  in  order  to  include  only  the  first  part  of  the  original 
window,  then  fewer  iterations  are  required  for  convergence.  However,  if  the  window 
sizes  are  made  too  small,  then  excessive  time  points  will  be  introduced  at  the  window 
boundaries  for  those  subcircuits  that  have  constant  or  slowly  changing  signals.  The 
window  boundaries  are  modified  dynamically  during  the  solution  process  based  on  the 
observed  convergence  speed  and  signal  activity. 


During  the  actual  equation  solution  phase  of  a  waveform  relaxation  program,  the 
windows  are  processed  sequentially,  with  the  final  solution  point  of  one  window  serv¬ 
ing  as  the  initial  condition  of  the  next  window.  Within  each  window,  the  subcircuits 
are  solved  independently  on  each  iteration,  using  previously  computed  or  guessed 
values  for  the  voltages  which  are  external  to  the  subcircuit  being  solved.  The 
differential  equation  of  subcircuit  i  in  window  [fa.  tb  ]  on  iteration  k  is  given  by 


.,«(*)  „  <* )  (t )  - 

9i<vi.i  •  •  •  •  >  vi-u-  vi  •  v< 
ISV  l.i  -  •  '  *  *  Vi-W  v,  *  *.♦!.<• 


v„  ,  ,  u ),  t  €[ra,r6]. 


where  vi  is  the  vector  of  node  voltages  for  those  nodes  that  are  in  subcircuit  i ,  and 
and  /,  are  vector  functions  containing  the  components  of  q  and  /  that  correspond  to 
the  nodes  of  subcircuit  i.  The  vector  v >(*\  for  any  j^i.  is  a  vector  of  voltage 


waveforms  for  the  nodes  belonging  to  subcircuit  / ,  which  are  treated  as  known  source 
waveforms  in  the  solution  of  subcircuit  i .  The  iteration  from  which  these  waveforms 
are  obtained  depends  on  which  relaxation  method  is  used.  When  Gauss-Seidel  relaxa- 
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.  if  j  <i 


vj.i  \  if  j  >i 


and  when  Gauss- Jacobi  relaxation  is  used. 


« (* )  (*-i)  ,  „  .  . . 

Vj  j  -Vj  .  for  all  j  *». 


In  all  cases  v(0)  is  a  vector  of  initial  guess  waveforms. 

The  basic  waveform  relaxation  algorithm  with  windowing,  using  either  Gauss- 
Seidel  or  Gauss-Jacobi  relaxation  can  be  summarized  as  follows: 


Algorithm  2.1.  Windowed  Waveform  Relaxation 

partition  into  subcircuits 
order  subcircuits 
t'-O 

while  (ta  <tf  )  {  /*  window  loop  */ 

choose  tb 
k- 1 

repeat  {  /*  relaxation  iteration  loop  */ 

if  (any  subcircuit  used  too  many  time  points)  decrease  tb 
if(*€A){ 

decrease  tb 

decrease  integration  error  tolerance 


for  (*-1.2 . n  )  {  /*  subcircuit  loop  */ 

solve  (2.3)  for  v/* )  /*  subcircuit  evaluation  task  */ 

} 

*«-*+l 

}  until  (convergence  obtained) 
reinitialize  integration  error  tolerance 


The  endpoint  of  the  current  window.  tb .  is  initially  determined  based  on  the 
number  of  points  and  the  number  of  iterations  of  the  previous  window.  Target  values 
of  approximately  60  time  points  and  3  iterations  are  used  in  RELAX2.3  to  control  the 
window  sizes.  If  these  targets  are  exceeded,  the  window  size  is  decreased.  If  the 
number  of  time  points  and  iterations  in  a  window  are  significantly  below  the  target 
values,  then  the  size  of  the  following  window  is  increased.  Most  windows  converge 


without  triggering  the  window  reduction  mechanism. 

The  set  K  contains  the  iteration  numbers  on  which  the  window  size  is  to  be 
reduced  to  encourage  faster  convergence.  RELAX2.3  uses  K  ={6.  12.  IS.  -  ■  -  }.  The 
reduction  of  the  integration  error  tolerance  which  accompanies  a  window  reduction 
causes  smaller  time  steps  to  be  used  in  the  numerical  integration,  since  the  waveform 
relaxation  convergence  theorem  assures  convergence  only  when  the  step  size  is 
sufficiently  small  [Whi86].  Convergence  is  detected  on  iteration  k  if  v(k)  matches 
v(i  _l).  within  a  specified  convergence  tolerance,  at  each  point  in  the  window  [ra.  tb  ]. 

Algorithm  2.1  does  not  include  the  partial  waveform  convergence  feature  in  which 
subcircuit  evaluations  are  bypassed  if  the  input  waveforms  of  the  subcircuit  match  the 
previous  iteration.  The  use  of  partial  waveform  convergence  in  parallel  waveform 
relaxation  is  addressed  in  Chapter  7. 

Nearly  all  of  the  computational  effort  is  concentrated  in  solving  (2.3)  in  the  inner¬ 
most  loop  of  the  algorithm.  This  is  a  problem  of  exactly  the  same  form  as  the  original 
problem  represented  by  (2.1):  however,  the  size  of  the  problem  is  smaller  both  in  terms 
of  the  number  of  unknowns  and  the  length  of  the  time  interval.  Conventional  circuit 
simulation  techniques  [Chu75.  Nag75]  are  used  to  solve  the  subcircuit  equations:  the 
time  scale  is  discretized  by  an  implicit,  stiffly  stable,  variable  step  size  numerical 
integration  algorithm  (Gea7l]:  the  nonlinear  algebraic  equations  which  result  at  each 
time  point  are  solved  iteratively  using  Newton's  method;  and  the  linear  equations  aris¬ 
ing  on  each  Newton  iteration  are  solved  using  Gaussian  elimination. 

2.2  Parallel  Algorithms 

The  objective  of  this  section  is  to  identify  the  parallelism  in  Algorithm  2.1.  which 
can  be  exploited  on  parallel  processors.  Attention  is  restricted  to  the  portion  of  the 
algorithm  that  actually  solves  the  differential  equations,  excluding  the  partitioning  and 
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ordering  steps.  Although  parallelism  can  also  be  exploited  in  partitioning  and  ordering, 
these  tasks  account  for  a  small  fraction  of  the  overall  computation  time,  and  therefore, 
the  parallelization  of  these  tasks  offers  only  limited  opportunity  for  speeding  up  the 
overall  run  time. 

At  the  highest  level  of  Algorithm  2.1  is  the  window  loop.  In  each  window,  the 
initial  conditions  are  obtained  from  the  final  voltage  values  of  the  previous  window. 
These  values  are  known  only  when  the  previous  window  converges.  Consequently,  the 
windows  must  be  processed  serially.  Within  a  window,  the  main  computational  tasks 
are  subcircuit  evaluation  tasks.  Each  of  these  tasks  consists  of  the  computations 
required  to  solve  a  subcircuit  over  an  entire  time  window  on  one  relaxation  iteration. 
Some  of  the  subcircuit  evaluation  tasks  can  be  performed  concurrently,  but  restrictions 
are  imposed  on  the  sequence  of  task  executions  due  to  the  propagation  of  waveforms 
from  one  task  to  another. 

A  conservative  method  of  managing  the  restrictions  on  parallel  task  executions  is 
the  full  window  technique.  In  this  scheme,  a  subcircuit  evaluation  task  is  allowed  to 
begin  executing  only  after  all  of  its  input  waveforms  from  other  tasks  are  available 
over  the  entire  current  window.  For  example,  if  subcircuit  7  on  iteration  3  requires 
input  waveforms  from  subcircuit  1  on  iteration  3  and  subcircuit  8  on  iteration  2,  then 
the  evaluation  of  subcircuit  7  would  not  begin  on  iteration  3  until  after  the  solutions  of 
subcircuits  1  and  8  were  completed  over  the  entire  window  for  iterations  3  and  2. 
respectively.  Scheduling  of  subcircuit  evaluation  tasks  in  the  full  window  technique  is 
relatively  inexpensive.  When  a  task  finishes,  it  can  check  to  see  if  any  of  the  tasks  to 
which  it  supplies  waveforms  are  ready  to  start  executing.  These  checks  need  only  be 
performed  after  a  task  computes  the  last  time  point  of  a  window. 
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The  full  window  technique  utilizes  a  course  grain  of  parallelism  in  that  it  treats  a 
subcircuit  evaluation  task  as  an  indivisible  schedulable  entity,  and  subcircuit  evalua¬ 
tion  tasks  normally  involve  a  large  number  of  computations.  The  advantage  of  this 
approach  is  that  the  scheduling  overhead  is  small  compared  to  the  amount  of  computa¬ 
tions  performed  in  the  subcircuit  evaluations.  The  disadvantage  is  that  the  full  degree 
of  available  parallelism  is  not  exploited.  The  time  point  pipelining  parallelization  stra¬ 
tegy  exposes  greater  parallelism,  at  the  expense  of  increased  overhead,  by  vising  a  finer 
granularity. 

In  time  point  pipelining,  the  subcircuit  evaluation  tasks  are  broken  down  into  sub¬ 
tasks.  each  consisting  of  the  evaluation  of  a  subcircuit  on  a  single  iteration  at  a  single 
time  point.  Each  of  these  subtasks  is  allowed  to  begin  executing  as  soon  as  all  of  its 
input  data  are  available.  In  the  example  cited  above,  subcircuit  7  may  compute  a  time 
point  at  time  t  after  the  evaluations  of  subcircuits  1  and  8  have  progressed  through 
time  t  on  iterations  2  and  3.  respectively.  Since  computations  can  begin  sooner  in  time 
point  pipelining  than  in  the  f  ull  window  technique,  the  overall  completion  time  should 
be  smaller  for  time  point  pipelining,  and  the  degree  of  parallelism  should  be  larger. 

The  scheduling  of  computations  in  time  point  pipelining  is  more  expensive  than  in 
the  full  window  technique,  because  checks  must  be  performed  after  the  computation  of 
each  time  point  to  determine  if  any  other  subtask  is  eligible  to  execute  as  a  result  of  the 
availability  of  the  newly  computed  time  point  data.  Consequently,  the  full  window 
technique  and  time  point  pipelining  offer  a  tradeoff  between  lower  overhead  and  greater 
parallelism. 

Further  parallelism  can  be  exploited  within  each  individual  time  point  subtask, 
with  an  accompanying  increase  in  overhead.  Since  standard  direct  method  circuit  simu¬ 
lation  techniques  are  used  at  the  subcircuit  level,  the  problem  of  parallelizing  compuia- 
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tions  within  a  single  subcircuit  evaluation  task  is  equivalent  to  the  problem  of  parallel¬ 
izing  the  standard  direct  methods.  The  evaluation  of  the  model  equations  for  each  cir¬ 
cuit  element  can  be  performed  in  parallel  on  each  Newton  iteration,  and  the  loading  of 
the  Jacobian  matrix  is  readily  parallelized.  The  parallelization  of  the  solution  of  the 
linear  equations  is  hampered  by  the  high  degree  of  sparsity  in  the  matrix,  the  irregular 
pattern  of  nonzero  matrix  entries,  and  the  large  number  of  data  dependencies  in  Gaus¬ 
sian  elimination.  In  the  context  of  waveform  relaxation,  the  amount  of  parallelism 
available  within  a  single  subcircuit  evaluation  task  will  be  small  compared  to  the  paral¬ 
lelism  of  direct  methods  applied  to  the  entire  circuit,  because  the  subcircuit  sizes  are 
typically  small.  For  those  circuits  which  cannot  be  successfully  partitioned  into  subcir¬ 
cuits  of  uniformly  small  size,  the  use  of  parallelism  within  the  subcircuit  evaluation 
tasks  offers  a  potential  for  accelerating  the  solution  of  the  larger  subcircuits,  which  tend 
to  create  bottlenecks  in  the  full  window  technique  and  time  point  pipelining.  The  use 
of  parallelism  within  individual  subcircuit  evaluation  tasks  is  not  considered  further 
here.  Instead,  attention  is  focused  on  the  natural  parallelism  between  different  subcir¬ 
cuits  that  arises  directly  from  the  use  of  waveform  relaxation. 

The  full  window  technique  and  time  point  pipelining  are  two  different  methods  of 
orchestrating  the  parallel  execution  of  a  fixed  set  of  compulations.  Contrastingly,  the 
Gauss-Seidel  and  Gauss-Jacobi  relaxation  methods  result  in  different  computations 
being  performed  to  reach  approximate  solutions  that  match  the  exact  solution  within 
some  acceptable  tolerance.  The  Gauss-Seidel  method  generates  a  set  of  computations 
which  generally  converges  to  the  solution  in  fewer  iterations  than  Gauss-Jacobi.  How¬ 
ever.  the  Gauss-Jacobi  method  generates  a  set  of  computations  which  has  a  higher  degree 
of  parallelism,  because  all  subcircuit  evaluation  tasks  of  any  given  iteration  can  be  exe¬ 
cuted  concurrently  without  any  waveform  communications  between  the  tasks.  This 
raises  the  question  of  whether  the  extra  parallelism  of  Gauss-Jacobi  is  sufficient  to 


overcome  the  penalty  of  requiring  a  larger  number  of  iterations. 

By  combining  either  the  full  window  technique  or  time  point  pipelining  with 
either  the  Gauss-Seidel  or  Gauss-Jacobi  relaxation  method,  one  of  four  different  parallel 
waveform  relaxation  algorithms  is  obtained.  The  four  algorithms  and  their  principal 
tradeoffs  are  summarized  in  Fig.  2.1. 

23  Task  Graphs 

Task  graphs  are  useful  tools  in  studying  and  implementing  parallel  algorithms.  A 
task  graph  is  a  directed  graph  in  which  the  vertices  represent  tasks  and  the  arcs 
represent  precedence  constraints  [Hwa84].  An  arc  from  vertex  t  to  vertex  j  indicates 
that  task  i  must  finish  before  task  j  is  allowed  to  begin  executing;  i  is  said  to  be  a 
predecessor  of  j  and  j  is  a  successor  of  i . 

Waveform  relaxation  task  graphs  are  closely  related  to  the  subcircuit  interconnec¬ 
tion  structure,  which  is  conveniently  represented  by  a  subcircuit  graph.  For  a  given  cir¬ 
cuit  partitioned  into  n  subcircuits,  in  which  the  subcircuits  are  numbered  from  1  ton 
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Figure  2.1.  Parallel  waveform  relaxation  algorithms. 


reflecting  a  given  ordering  of  the  subcircuits,  the  subcircuit  graph  is  defined  as  follows: 

Definition  2.1.  The  subcircuit  graph  G  contains  one  vertex  for  each  subcircuit,  labeled 
with  the  subcircuit  number.  An  arc  exists  from  vertex  i  to  vertex  j  if  and  only  if  a  node 
equation  in  subcircuit  j  depends  on  a  node  voltage  which  belongs  to  subcircuit  i. 

For  the  Gauss-Seidel  method,  waveforms  are  propagated  differently  depending  on 
whether  the  waveform  is  computed  in  a  subcircuit  that  has  a  higher  or  lower  number 
than  the  destination  subcircuit,  as  indicated  in  (2.4).  This  distinction  motivates  the 
next  definition. 

Definition  2.2.  An  arc  in  G  from  i  to  j  is  called  a  feedforward  arc  if  i  <  j,  and  is  called 
a  feedback  arc  if  i  >  j. 

This  terminology  agrees  with  the  standard  circuit  concept  of  feedback  provided  the  ord> 
ering  of  subcircuits  conforms  to  the  predominant  direction  of  signal  flow  in  the  circuit. 

23.1  Full  window  technique 

A  waveform  relaxation  task  graph  for  the  full  window  technique  depends  on 
which  relaxation  method  is  employed  and  on  the  number  of  iterations  to  be  performed, 
as  reflected  in  the  notation  introduced  in  the  next  definition. 

Definition  23.  For  a  given  G,  the  task  graphs  for  m  iterations  of  the  Gauss- Jacobi  and 
Gauss-Seidel  methods  using  the  full  window  technique  are  denoted  as  ? Gl.m  and  T GS.m  • 
respectively.  Each  vertex  represents  a  subcircuit  evaluation  task  and  is  labeled  with  an 
ordered  pair  (k.  i  ),  where  k  is  the  iteration  number  and  i  is  the  subcircuit  number.  An 
arc  exists  from  (k  ,.i  t)  to  ( k2,i2 )  if  and  only  if  ( k2,i2 )  requires  an  input  waveform  from 
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The  task  graphs  can  be  constructed  from  G  .  based  on  the  waveform  communica¬ 
tion  rules  for  the  applicable  relaxation  method.  Each  vertex  i  of  G  maps  to  m  vertices 
in  the  task  graph,  labeled  (A.i ).  for  A  €{1.2.  •••.»»}.  Each  of  these  task  graph  ver¬ 
tices  is  said  to  be  an  instance  of  vertex  i  in  G .  For  the  Gauss-Jacobi  task  graph,  each 
arc  in  G  from  a  vertex  t  to  a  vertex  j  maps  to  m— 1  arcs  in  T0Jm .  from  (k.i)  to 
(A+l.  j ).  for  k  €{1.2,  •  •  • .  m  — 1>.  These  arcs  are  said  to  be  instances  of  the  arc  in  G 
from  i  to  j .  For  the  Gauss-Seidel  task  graph,  each  feedback  arc  in  G  from  i  to  j  maps 
to  m  —1  instances  in  Tas  m  .  from  (A.i  )  to  (k  +1,  ;  ).  for  A  €{1.2.  •  ■  •  .m  — l};  and  each 
feedforward  arc  from  i  to  j  maps  to  m  instances  in  ^GS.  m  •  from  (k.i )  to  (A.  j  ).  for 
A  €{1.2.  Figure  2.2  shows  a  sample  subcircuit  graph  and  corresponding 

Gauss-Seidel  and  Gauss-Jacobi  task  graphs  for  the  case  m  =2. 

The  instance  relationships  introduced  in  the  construction  of  the  task  graphs  define 
mappings  between  elements  of  the  subcircuit  graph  and  the  task  graphs.  These  relation¬ 
ships  are  useful  in  proving  theorems  which  relate  properties  of  parallel  relaxation 
methods  to  properties  of  G .  If  T  is  one  of  the  task  graphs,  then  observe  that  any 
directed  path  in  T  maps  to  a  directed  walk  in  G .  such  that  the  j‘h  vertex  (or  arc)  of 
the  path  in  T  is  an  instance  of  the  j,h  vertex  (or  arc)  of  the  walk  in  G .  An  arc  in  T  is 
referred  to  as  a  feedback  or  feedforward  arc.  based  on  whether  it  is  an  instance  of  a 
feedback  or  feedforward  arc,  respectively.  To  simplify  subsequent  discussions  of  the 
subcircuit  and  task  graphs,  the  terms  path,  walk,  and  cycle  will  be  used  to  refer  to 
directed  paths,  walks,  and  cycles. 

23.2  Time  point  pipelining 

Task  graphs  for  the  time  point  pipelining  method  can  be  constructed  in  which  each 
vertex  represents  the  computation  of  a  single  time  point  in  a  single  subcircuit  on  a  single 
iteration. 
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Iteration  2 
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Figure  2.2.  Subcircuit  graph  and  related  task  graphs:  (a)  G  ;  (b)  ^gs.2*  (c)  Tgj  2. 
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Definition  2 A.  For  a  given  circuit  with  a  given  G,  the  unaugmented  task  graphs  for  m 
iterations  of  the  Gauss-Jacobi  and  Gauss-Seidel  methods  using  time  point  pipelining  are 
denoted  as  Ttpp^-j  m  and  TTPP.GSm  ,  respectively.  Each  vertex  represents  a  time  point 
subtask  which  consists  of  the  evaluation  of  a  time  point  in  a  subcircuit  on  a  relaxation 
iteration,  and  is  labeled  (k.i.t),  where  k  is  the  iteration  number,  i  is  the  subcircuit 
number,  and  t  is  the  value  of  the  time  variable. 

A  time  point  pipelining  task  graph  Trpp  can  be  constructed  from  the  corresponding 
full  window  technique  task  graph  T  and  from  knowledge  of  the  number  and  locations 
of  the  time  points.  When  a  time  point  at  some  time  t  is  computed  in  subcircuit  evalua¬ 
tion  task  (k.i ).  its  time  value  is  initially  determined  based  on  the  previous  time  step 
and  the  estimated  local  truncation  error  at  that  step.  Let  t^  (k.i.t)  be  this  initial 
choice  of  the  time  value.  If  the  number  of  Newton  iterations  is  excessive  or  if  the  local 
truncation  error  is  excessive  for  the  new  time  point  at  time  tMil  (k.i.t ).  then  the  time 
step  is  reduced  repeatedly  until  the  Newton  iteration  count  and  the  truncation  error  are 
acceptable.  The  final  value  of  the  time  variable  t  for  the  time  point  is  less  than  or 
equal  to  t^  (k.i.t ).  Consequently,  when  computing  the  time  point  at  time  t .  it  is 
necessary  for  the  input  waveforms  to  be  available  through  time  t^  (k.  i.t ). 

The  construction  of  Ttpp  begins  by  defining  vertices  for  all  the  time  point  subtasks 
and  labeling  them  as  specified  in  Definition  2.4.  Consider  any  arc  in  T  from  a  task 
(k  j.  i  j)  to  ( k2,i2 ).  For  each  time  point  t2  in  ( k2,i2 ).  let  f ,  be  the  smallest  time  point 
in  (*  j.  i  j)  such  that  f ,  ^  t fcjf  ( k  2,i2.t  2).  Then  there  is  an  arc  in  Trpp  from  ( k  ij.  t ,)  to 

(k  j.  *  2’  ^  2^' 

Unlike  the  task  graphs  for  the  full  window  technique,  the  time  point  pipelining 
task  graphs  cannot  be  constructed  solely  on  the  basis  of  G  and  m .  The  number  of  time 
points  to  be  computed  and  their  locations  on  the  time  axis  depend  on  the  signal  activity 


in  the  subcircuits,  and  this  information  is  not  known  prior  to  simulating  the  circuit. 
The  time  point  pipelining  task  graphs  may  be  constructed  after  the  simulation  is  com¬ 
pleted.  or  they  can  be  constructed  dynamically  during  the  simulation.  For  this  reason, 
the  time  point  pipelining  task  graphs  play  a  less  important  role  than  the  full  window 
technique  task  graphs  in  presimulation  studies  of  parallelism.  Due  to  the  relationship 
between  the  time  point  pipelining  and  full  window  technique  task  graphs,  the  simpler 
full  window  technique  task  graphs  can  be  used  in  the  study  and  implementation  of 
time  point  pipelining  by  recognizing  implicitly  that  each  task  consists  of  a  sequence  of 
time  point  subtasks  with  suitable  precedence  constraints. 

2A  First  Order  Analysis  of  Parallel  Algorithms 

Based  on  the  task  graphs  and  some  simplifying  assumptions,  first-order  measures 
of  the  parallelism  of  different  relaxation  methods  and  different  parallelization  strategies 
can  be  derived.  The  first-order  estimates  are  not  accurate  measures  of  the  parallel  pro¬ 
cessing  speedups  that  can  be  obtained  in  practice.  Nonetheless,  these  estimates  provide 
basic  insights  into  the  different  parallel  algorithms  and  the  fundamental  tradeoffs 
involved.  More  accurate  models  of  parallelism  are  developed  in  Chapter  4. 

2.4.1  Full  window  technique 

The  following  simplifying  assumptions  apply  to  the  first-order  analysis  of  the  full 
window  technique: 

(a)  Each  subcircuit  evaluation  task  requires  exactly  one  unit  of  computation  time. 

(b)  There  is  a  sufficient  number  of  processors  such  that  each  task  can  begin  executing 
immediately  after  all  of  its  predecessors  in  the  task  graph  have  finished  executing. 

(c)  There  is  no  parallel  processing  overhead  due  to  task  scheduling,  data  communica¬ 
tions.  or  contention  for  shared  resources. 
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Since  each  task  is  assumed  to  require  one  unit  of  time,  and  since  each  task  is 
assumed  to  begin  immediately  after  its  predecessors  terminate,  the  task  graph  can  be 
partitioned  into  levels  such  that  the  tasks  in  level  1  are  those  which  are  active  during 
processor  time  interval  (1—1.1].  Figure  2.3  shows  the  levelizations  of  the  task  graphs 
of  Fig  2.2. 
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Figure  2.3.  Levelized  task  graphs:  (a)  Tas  2:  (b)  T0J  2. 
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For  a  task  graph  T ,  the  time  required  to  execute  all  the  tasks  on  a  multiprocessor 
with  a  sufficient  number  of  processors  is  equal  to  the  depth  of  the  graph.  D  ( T  ).  which 
is  defined  as  the  number  of  vertices  in  a  longest  directed  path,  or  equivalently  the 
number  of  levels  in  the  levelized  graph.  The  average  number  of  tasks  executed  at  each 
step  is  the  average  width  of  the  graph.  W  ( T  )*|V  (T  )J ID  (T  ).  where  |V  (T  )j  is  the  total 
number  of  vertices  in  the  graph.  For  the  example  in  Fig.  2.3.  D(rGS  2)*7.  D(TC7  2)*2. 
W&cs  ^1-7-  and  W(rc/  2)«6. 

The  superior  parallelism  of  Gauss-Jacobi  over  Gauss-Seidel  is  apparent  from  the 
task  graphs.  In  general,  m  iterations  of  Gauss-Jacobi  require  only  m  steps,  since  within 
each  iteration  all  the  tasks  can  be  executed  concurrently.  Except  for  some  degenerate 
cases  mentioned  in  Chapter  3.  D  (Tc;  m  )*m  and  W(7*6/m)*«.  In  contrast,  the  tasks 
of  any  single  iteration  of  Gauss-Seidel  generally  have  data  interdependencies  such  that 
they  cannot  all  be  executed  concurrently.  Normally.  D(T0Sm)»D(T0Jm)  and 
W(T0S'm)«W(T0;ml 

The  superior  parallelism  of  Gauss-Jacobi  is  achieved  at  the  expense  of  an  increase 
in  the  number  of  iterations  required  to  converge.  In  the  Gauss-Seidel  case  a  signal  can 
propagate  through  an  entire  forward  signal  path  in  a  single  iteration.  In  the  Gauss- 
Jacobi  case,  however,  signals  propagate  at  a  rate  of  one  subcircuit  per  iteration.  This 
suggests  that  circuits  with  long  signal  paths  will  require  a  large,  perhaps  prohibitive, 
number  of  Gauss-Jacobi  iterations.  However,  in  a  given  time  window,  long  signal  paths 
are  frequently  broken  into  smaller  subpaths  by  logic  gates,  pass  transistors,  or  clocked 
devices  which  block  the  signal  flow.  The  number  of  Gauss-Jacobi  iterations  is  then 
determined  by  the  number  of  subcircuits  in  each  subpath  and  by  feedback  between  sub- 
circuits.  Even  if  a  long  path  is  present,  the  Gauss-Jacobi  iterations  may  not  be  excessive 
if  the  subpaths  are  short. 


Consider  the  subcircuit  graph  and  related  task  graphs  in  Fig.  2.4.  The  circuit  con¬ 
sists  of  a  single  long  path  with  no  feedback.  The  Gauss-Seidel  method  produces  the 
exact  solution  in  1  iteration,  and  the  Gauss-Jacobi  method  requires  exactly  n  iterations. 
However,  both  methods  require  exactly  n  units  of  compulation  time.  The  Gauss-Jacobi 


)t  O  O  D  'b  T>  O  '  .  .  .  T> 
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Figure  2.4.  Long  signal  path  example,  (a)  G :  (b)  Tcs  (c)  TGJ  „  . 
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method  requires  more  processors,  but  it  produces  the  same  result  in  the  same  amount  of 
time  in  this  example. 

Suppose  that  a  logic  gate  or  pass  transistor  in  subcircuit  n/2  is  in  a  state 
throughout  some  particular  time  window  such  that  signal  flow  from  subcircuit  n  / 2  to 
n/2 4-1  is  blocked.  Since  the  subcircuit  graph  is  unchanged,  the  Gauss-Seidel  method 
will  still  require  1  iteration  and  n  units  of  time.  The  Gauss-Jacobi  method  will  obtain 
the  exact  solution  in  n  fl  iterations,  requiring  only  n  fl  units  of  time,  because  the  long¬ 
est  effective  signal  path  has  length  n  fl.  Gauss-Jacobi  effectively  solves  the  two  isolated 
portions  of  the  circuit  concurrently. 

In  digital  circuits,  signal  paths  are  frequently  blocked  by  clocked  devices  during 
certain  intervals  of  time.  The  automatic  windowing  algorithm  of  RELAX2.3  typically 
chooses  windows  which  are  smaller  than  one  clock  period.  Therefore,  in  a  typical  win¬ 
dow.  the  effective  signal  path  lengths  are  much  shorter  than  the  paths  in  the  subcircuit 
graph.  The  Gauss-Jacobi  method  automatically  exploits  the  parallelism  between  the 
essentially  disconnected  portions  of  the  circuit:  the  Gauss-Seidel  method  does  not. 

In  the  segmented  waveform  relaxation  method  [Dum57].  the  time  axis  is  parti¬ 
tioned  into  time  segments,  whose  boundaries  are  explicitly  synchronised  with  clock 
transitions.  Portions  of  the  circuit  which  are  disconnected  by  clocked  logic  elements 
during  a  time  segment  are  recognised,  and  the  Gauss-Seidel  method  is  applied  separately 
and  concurrently  to  the  disconnected  portions  for  the  first  iteration  of  the  relaxation. 
This  approach  exploits  the  parallelism  between  different  parts  of  the  circuit  while 
retaining  the  Gauss-Seidel  ordering  within  the  isolated  circuit  portions.  However,  the 
program  is  required  to  recognize  the  clocked  elements  and  derive  the  resulting  effective 
circuit  partitioning.  Thus,  the  method  is  less  general  and  more  complicated  than 
Gauss-Jacobi.  In  those  cases  where  it  is  applicable,  it  may  be  more  effective  than 
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Gauss-Jacobi  when  the  number  of  processors  is  limited,  because  it  avoids  some  of  the 
redundant  computations  which  tend  to  occur  in  Gauss-Jacobi  on  the  iterations  before  a 
signal  reaches  a  particular  subcircuit. 

The  preceding  discussion  gives  reasons  why  the  Gauss-Jacobi  method  may  be 
preferable  to  Gauss-Seidel  on  parallel  processors.  Further  theoretical  and  experimental 
evidence  will  be  presented  in  subsequent  chapters  to  show  that  when  a  sufficient 
number  of  processors  is  available,  parallel  Gauss-Jacobi  is  faster  than  parallel  Gauss- 
Seidel. 

In  preparation  for  comparing  the  full  window  technique  with  time  point  pipelin¬ 
ing.  the  following  assumption  is  added  to  those  previously  introduced  for  the  first- 
order  analysis  of  the  full  window  technique: 

(d)  The  number  of  tasks  in  each  level  of  the  task  graph  is  constant. 

The  number  of  active  tasks  can  then  be  plotted  as  a  function  of  time,  as  in  Fig.  2.5(a). 
Since  a  different  number  of  iterations  is  generally  required  for  the  Gauss-Jacobi  and 
Gauss-Seidel  methods,  the  symbols  and  i\iS  are  used  to  represent  the  number  of 
iterations  required  for  each  method  to  converge,  where  typically  >mos .  The 
curves  emphasize  the  fact  that  even  though  Gauss-Jacobi  requires  more  computations  as 
represented  by  the  area  under  the  curve,  its  greater  parallelism  can  lead  to  a  smaller 
overall  computation  time.  The  number  of  processors  required  to  achieve  the  fastest 
possible  computation  time  for  Gauss-Jacobi  is  greater  than  that  needed  for  Gauss-Seidel. 
as  indicated  by  the  heights  of  the  curves. 


2-4.2  Time  point  pipelining 

The  first-order  analysis  of  time  point  pipelining  is  based  on  the  full  window  task 
graphs.  The  tasks  are  implicitly  divided  into  subtasks,  each  consisting  of  the  evaluation 


or 


(b) 

Figure  2.5.  Active  tasks  vs.  time  for  (a)  full  window  technique  and  (b)  time  point 
pipelining,  based  on  simplified  parallelism  model.  Dos  s D  (70J  m  )  and 

W„mW(rMm  ). 


of  a  subcircuit  on  an  iteration  at  a  time  point.  Subtasks  are  represented  by  ordered  tri¬ 
ples  of  the  form  ( k.i.t  ),  such  that  (k.i  )  is  the  subcircuit  evaluation  task  which  con¬ 
tains  the  subtask,  and  t  is  the  value  of  the  time  variable  at  the  time  point.  The  follow¬ 
ing  simplifying  assumptions  apply  to  the  first-order  analysis  of  time  point  pipelining,  in 
addition  to  the  assumptions  used  in  the  full  window  case: 

(e)  Each  subcircuit  evaluation  task  contains  p  time  point  subtasks. 

(f)  Each  subtask  requires  the  same  amount  of  computation  time.  1  Ip  time  units. 

(g)  The  time  point  locations  on  the  time  axis  are  identical  in  each  task. 

(h)  No  time  step  reductions  occur  due  to  excessive  integration  truncation  error  or 
excessive  Newton  iterations. 

(i)  There  is  a  sufficient  number  of  processors  such  that  each  subtask  (k.  i.t  )  can  begin 
execution  immediately  after  all  predecessors  of  (k.i)  have  finished  computing  the 
time  point  at  time  t .  and  (k.  i  )  has  computed  the  time  point  preceding  t . 

These  assumptions  are  generally  not  accurate,  but  are  useful  to  demonstrate  the  basic 
properties  of  the  algorithms.  Assumptions  (e)  and  (g)  do  not  hold  exactly  in  practice 
because  the  time  points  are  chosen  independently  in  each  subcircuit  based  on  the  signal 
activity  within  the  subcircuit.  Assumption  (f)  does  not  hold  exactly  because  subcir¬ 
cuits  of  different  si2es  have  subtasks  requiring  different  compuution  times.  Generally 
the  assumptions  tend  to  result  in  overestimates  of  parallelism. 

When  time  point  pipelining  was  originally  introduced  it  was  used  in  conjunction 
with  the  Gauss-Seidei  method  as  a  way  to  increase  the  parallelism  without  giving  up 
the  fast  convergence  properties  of  Gauss-Seidei.  However,  the  technique  is  equally 
applicable  to  any  relaxation  method  that  can  be  represented  by  a  full  window  technique 
task  graph,  including  Gauss- Jacobi 


Let  T  be  any  full  window  technique  task  graph,  and  consider  the  operation  of  the 
time  point  pipelining  algorithm  applied  to  T  as  the  parallel  computation  time  Z 
progresses,  using  the  stated  simplifying  assumptions.  Let  e  (Z )  be  the  number  of  active 
tasks  in  processor  time  interval  ((Z— l)/p,  Z/p].  Initially,  each  task  in  level  1  of  T  can 
compute  its  first  time  point,  and  therefore  c  (1)=W  (r ).  assuming  the  task  graph  has 
constant  width.  Once  these  time  point  computations  are  completed,  the  second  time 
points  of  the  level  1  tasks  can  be  computed  concurrently  with  the  first  time  points  of 
the  level  2  tasks,  resulting  in  c(2)*2W(r).  The  parallelism  continues  to  increase  at  a 
rate  of  W  (T  )  until  either  all  tasks  in  the  task  graph  are  active  simultaneously,  or  until 
the  last  time  point  is  computed  in  the  level  1  tasks.  Consequently,  the  peak  parallelism 
is 


max{c  (Z )}  »  minln/n  .  W(T  )p } .  (2.6) 

After  the  last  time  point  is  computed  in  the  level  1  tasks  and  after  the  first  time  point 
is  computed  in  the  tasks  of  the  final  level,  the  parallelism  decays  at  a  rate  of  W  (r ). 
because  at  each  step  another  level  completes  its  last  time  point  and  no  new  levels  are 
started.  The  final  completion  time  is  given  by  U>(D+p—  l]/p .  because  after  D  (T ) 
steps  the  bottom  level  tasks  complete  their  first  time  points,  and  after  p  —1  more  steps 
they  complete  their  last  time  points,  where  each  step  requires  1  Ip  time  units.  Combin¬ 
ing  these  results,  a  formula  for  c  (Z )  is  obtained  for  all  integers  Z : 
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(2.7) 


where  D  and  W  are  the  depth  and  width  of  the  full  window  technique  task  graph. 


Generic  plots  of  the  number  of  active  tasks  as  a  function  of  parallel  processor 
time,  based  on  (2.7),  are  shown  in  Fig.  2.5(b)  (p.  25)  for  Gauss-Jacobi  and  Gauss-Seidel. 
The  substitution  W(TSJ  m  )-n  has  been  used  in  the  figure  to  emphasize  the  fact  that 
typically  W(T6J  mc/)=n  »  W(Tas  Note  that  the  parallelism  at  each  point  using 

time  point  pipelining  is  at  least  as  large  as  the  full  window  parallelism  using  the  same 
relaxation  method.  Also  note  that  both  the  peak  parallelism  and  the  rate  of  growth  of 
parallelism  are  greater  for  Gauss-Jacobi  time  point  pipelining  than  for  Gauss-Seidel 
time  point  pipelining.  Finally,  the  number  of  processors  required  to  use  all  the  avail¬ 
able  parallelism  of  Gauss-Jacobi  time  point  pipelining  can  be  quite  large  for  a  large  cir¬ 
cuit  if  large  windows  are  used. 

25  Augmented  Task  Graphs 

The  task  graphs  TCJ  m  and  T0S  m  account  for  the  evaluation  of  subcircuits  and  the 
data  communications  between  subcircuit  evaluation  tasks.  Although  these  are  the  most 
important  tasks  and  data  dependencies,  they  are  not  the  only  ones.  Computations  must 
be  performed  to  determine  when  the  iterations  converge.  The  convergence  checking 
operations  can  either  be  packaged  inside  the  subcircuit  evaluation  tasks,  or  they  can  be 
treated  as  separate  tasks.  In  either  case,  the  task  graph  must  be  augmented  by  adding 
arcs  or  both  vertices  and  arcs,  to  account  for  the  additional  data  dependencies  and.  in 
the  latter  case,  the  additional  tasks. 

25.1  Separate  convergence  checking  tasks 

For  each  subcircuit  evaluation  task  (k.i )  in  window  [ta.tb],  a  convergence  check¬ 
ing  task  {k.i  )ce  can  be  defined,  which  executes  the  following  algorithm: 

Algorithm  2.2.  Convergence  Checking  Task:  {k ,  i  )„ 

if  (v/*  \t )  matches  v/*  -1)(f ),  for  all  t  €[fa.  tb  ],  within  tolerance)  I 
unconvk  *-unconvk  —1 


if  ( unconvk  =0)  signal  global  convergence 

I 

The  counter  unconvk  represents  the  number  of  unconverged  subcircuits  in  iteration  k  , 
and  is  initialled  to  n .  The  decrementing  and  testing  of  the  counter  must  be  performed 
in  a  critical  section  to  insure  the  integrity  of  the  count  values  since  different  conver¬ 
gence  checking  tasks  may  execute  concurrently.  When  global  convergence  is  signaled, 
then  any  remaining  incomplete  tasks  in  the  task  graph  are  superfluous  since  they  are  for 
iterations  greater  than  the  iteration  in  which  convergence  was  obtained. 

The  addition  of  the  convergence  checking  tasks  and  the  accompanying  data  depen¬ 
dencies  are  reflected  in  the  augmented  task  graphs  which  are  defined  as  follows: 

Definition  2.5.  For  any  task  graph  Ta}  m  or  TOS  m  ,  the  corresponding  augmented  task 
graph  TGj.m  w  7 GS.m  is  constructed  by  adding  a  task  (k.  i  )„  for  each  task  (*.  i  ),  and 
adding  arcs  from  ( k.i  )  to  (k.i  )<x  for  all  k  €{l....,m  },  i  €{l,...,n  },  and  adding  arcs  from 
Ok  — 1. a  )  to  (*.»)„  for  all  k  €{2.....m),i  €{l.....n}. 

An  important  feature  of  these  augmented  task  graphs  is  that  there  are  no  new  arcs 
terminating  at  subcircuit  evaluation  tasks.  Consequently,  the  parallel  completion  time 
of  the  subcircuit  evaluation  tasks  is  not  affected  by  the  additions  to  the  task  graph. 
Convergence  checking  tasks  can  be  executed  concurrently  with  subcircuit  evaluation 
tasks,  except  for  the  convergence  checking  task  corresponding  to  the  last  subcircuit 
evaluated  in  the  converging  iteration.  Therefore,  the  parallel  completion  time  for  the 
augmented  graph  is  greater  than  the  original  graph  by  the  time  of  one  convergence 
checking  task,  which  is  much  smaller  than  a  subcircuit  evaluation  task.  For  this  reason, 
the  use  of  the  unaugmented  task  graphs  is  justified  in  the  study  of  parallel  waveform 
relaxation,  even  though  these  graphs  do  not  explicitly  account  for  convergence  checking. 
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2JS.2  Convergence  checking  in  subcircuit  evaluation  tasks 

An  alternative  approach  to  handling  convergence  checking  is  to  treat  the  conver¬ 
gence  checking  task  ( k.i )„  as  part  of  the  subcircuit  evaluation  task  (k.  i  ).  This  results 
in  a  simpler  implementation  with  lower  overhead  for  task  scheduling.  However,  extra 
arcs  must  be  added  to  the  task  graph  to  assure  that  the  waveforms  are  available  from 
(k—  l.t )  as  needed  by  the  convergence  checker  inside  (k.i ).  A  different  type  of  aug¬ 
mented  task  graph  results  from  this  treatment  of  convergence  checking,  as  given  in  the 
following  definition. 

Definition  2.6.  For  any  task  graph  TC}  m  or  Tas  m  ,  the  corresponding  augmented  task 
graph  Taj  m  or  TGS  m  is  constructed  by  adding  arcs  from  (k.i)  to  (ifc+l.i)  for  all 
k  €{l....sn—  1},  i  €{l,....n  }. 

The  added  constraints  produce  an  additional  simplification  in  the  implementation 
of  the  parallel  algorithm,  because  the  constraints  prevent  any  subcircuit  from  being 
active  in  more  than  one  iteration  at  any  moment  of  time.  For  example,  tasks  (1.1)  and 
(2.  1)  are  not  allowed  to  execute  concurrently,  because  these  are  two  instances  of  the 
same  subcircuit  in  different  iterations.  However,  different  subcircuits  are  allowed  to  be 
active  in  different  iterations  concurrently,  to  the  extent  permitted  by  the  original  task 
graph.  For  example,  in  the  task  graph  of  Fig.  2.2(a).  tasks  (1.  4)  and  (2,  1)  may  be  exe¬ 
cuted  concurrently,  because  there  is  no  path  connecting  them.  The  fact  that  only  one 
instance  of  each  subcircuit  can  be  active  at  a  time  means  that  some  data  structures  can 
be  allocated  once  for  each  subcircuit  and  used  by  all  instances  without  conflict. 

Adding  arcs  to  the  task  graph  can  reduce  the  degree  of  parallelism  by  requiring 
some  tasks  to  be  executed  serially,  which  previously  could  have  been  executed  in  paral¬ 
lel.  The  degree  by  which  the  parallelism  is  reduced  by  the  additional  arcs  is  investi¬ 
gated  in  the  following  theorems. 
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Definition  2.7.  For  a  given  C,  let  0CJ  (s  )  and  $GS  (.s  )  be  the  maximum  number  of 
instances  of  subcircuit  s  that  can  be  active  concurrently  using  the  full  window  technique 


based  on  the  unaugmented  task  graphs  TGj.  oo  aTU*  Tos.  tx>,  respectively,  where  the  task  exe¬ 
cution  times  and  number  of  processors  are  arbitrary. 

Definition  2JJ.  For  any  directed  graph  H  in  which  each  arc  is  designated  as  a  feedback 
or  feedforward  arc,  let  f  (H)  be  the  number  of  feedforward  arcs  in  H  and  let  b(H)  be 
the  number  of  feedback  arcs  in  H  . 


Theorem  2.1.  If  s  is  a  vertex  of  G  then 


and 


PGr(s ) 


/  (C  ,)+i  (C  j)  ,  if  s  is  in  a  cycle 
oo  .  otherwise 


(2.8) 


Pgs  (j  )  = 


MC2) 


where  Cx  is  a  cycle  of  G  which  minimizes 
minimizes  b  (C  2). 


•  if  s  is  in  a  cycle  ^ 

.  otherwise 

/  (Cj)+h(CI)  and  C2is  a  cycle  of  G  which 


To  prove  the  theorem,  first  consider  the  Gauss-Jacobi  case  and  suppose  5  is  in  a 
cycle.  For  each  k  >0  there  is  a  path  from  ( k.s  )  to  (k  if  (C^+tCCj).  s  )  in  Tgj.vo- 
corresponding  to  one  traversal  of  Cj.  Therefore,  {k.s)  and  (k+i[f  (C  x)+6  (C  s  ) 

cannot  be  active  simultaneously,  for  any  i  € {1, 2,  •  ■  •  }.  If  (k  j,  s  ) . ( ky.s  )  are  active 

simultaneously,  then  k  j  mod  (/  (C  ^+6  (C  t)) . k  y  mod  if  (C  t)+h  (C ,))  must  be  dis¬ 

tinct.  Hence  y </  (C  j)+ft  (C  which  implies 


0G/(s)*f  (Cj+bCCj.  (2.10) 

Note  that  the  first  /  (C^+MCj)  instances  of  s  are  mutually  independent,  since  any 

directed  path  connecting  two  of  them  would  violate  the  fact  that  Ct  minimizes 

/  (CiHHCj).  If  these  independent  tasks  have  arbitrarily  large  computation  times 


compared  to  all  other  tasks,  they  will  be  active  concurrently,  given  enough  processors. 
Therefore, 

(CjH&CCi).  (2.11) 

Combining  (2.10)  and  (2.11)  verifies  the  first  part  of  (2.8).  If  s  is  not  in  a  cycle,  all 

instances  of  s  are  independent,  and  an  arbitrary  number  of  them  can  be  active  con¬ 
currently.  thus  completing  the  proof  of  (2.8).  The  Gauss-Seidel  result  is  obtained  by 
noting  that  for  each  k  >0  there  is  a  path  from  (k.  s )  to  (k  +b (C2).  s)  in  w 
corresponding  to  one  traversal  of  C  2.  and  then  following  the  same  argument  as  for  the 
Gauss-Jacobi  case  to  complete  the  proof. 

In  circuit  simulations  where  a  high  degree  of  accuracy  is  required,  accurate  models 
are  used  which  invariably  have  bidirectional  coupling  between  each  pair  of  nodes  that 
are  connected  through  a  circuit  element.  Even  MOS  transistor  gate  terminals,  which  are 
nearly  unidirectional,  are  subject  to  capacitive  feedback  from  the  source  and  drain  ter¬ 
minals.  In  terms  of  the  subcircuit  graph,  a  circuit  which  has  only  bidirectional  coupling 
will  have  an  arc  from  vertex  j  to  vertex  i  for  each  arc  from  i  to  / .  The  number  of 
possible  concurrent  instances  of  subcircuits  is  severely  restricted  for  circuits  with 
bidirectional  coupling,  as  evidenced  by  the  next  theorem. 

Theorem  2,2.  If  all  coupling  in  C  is  bidirectional,  and  if  C  contains  no  isolated  vertices, 
then  fiCJ  ( s  )=2  and  0GS (r )  =  1 . 

If  s  is  a  vertex  of  G  .  then  there  exists  a  vertex  /  such  that  there  are  arcs  from  s  to  j 
and  from  ;  tor.  These  two  arcs  constitute  a  cycle  C  containing  vertex  s  .  such  that 
/  (C  )—b  (C  )=1.  Since  any  cycle  must  contain  at  least  one  feedforward  arc  and  one 
feedback  arc.  C  minimizes  the  expressions  /  (C  )+fi  (C  )  and  b(C ).  The  proof  is  com¬ 
pleted  by  applying  Theorem  2.1. 


The  preceding  theorem  indicates  that  the  extra  arcs  of  the  augmented  graph  have 
absolutely  no  effect  on  the  parallelism  of  the  Gauss-Seidel  method  for  bidirectionally 
coupled  circuits.  But  in  the  Gauss-Jacobi  case,  a  factor  of  up  to  2  may  be  sacrificed  in 
parallelism.  In  the  definition  of  0G/ .  the  computation  times  of  the  tasks  are  unspecified, 
and  consequently.  fiGJ  represents  the  worst  possible  combination  of  task  times.  It  is 
readily  apparent  that  no  concurrently  active  instances  of  the  same  subcircuit  are  possi¬ 
ble  in  parallel  Gauss-Jacobi  if  all  the  tasks  in  the  task  graph  require  the  same  amount  of 
computation  time  and  if  the  number  of  processors  is  greater  than  or  equal  to  the 
number  of  subcircuits.  In  this  case  all  tasks  in  iteration  *  are  active  in  time  interval  * . 
and  no  overlapping  of  iterations  occurs.  In  this  case  the  extra  arcs  of  the  augmented 
graph  have  no  practical  effect  on  the  parallelism.  This  observation  motivates  the  fol- 
lowing  theorem  which  relates  the  loss  of  Gauss-Jacobi  parallelism  in  the  T  augmented 
graph  to  the  degree  of  mismatch  in  the  task  sizes.  First  the  concepts  of  task  graph  depth 
and  width  are  generalized  for  graphs  in  which  the  vertices  have  weights  representing 
the  compulation  times  of  the  tasks. 

Definition  2.9.  Let  Dw  (T )  and  Wv  (T )  represent  the  weighted  depth  and  weighted 
average  width  of  directed  graph  T  with  weighted  vertices,  where  Dw  {T  )  is  the  sum  of 
the  vertex  weights  in  a  path  which  has  a  maximum  vertex  weight  sum,  and  (T )  is 
equal  to  the  sum  of  all  vertex  weights  divided  by  Dw  (T  ). 

Then  Dw  is  the  parallel  completion  time  and  Ww  is  the  average  parallelism,  or  the  aver¬ 
age  number  of  active  processors,  assuming  an  unlimited  supply  of  processors. 


Theorem  23.  If  all  of  the  following  conditions  hold, 

(a)  G  has  only  bidirectional  coupling  and  no  isolated  vertices ; 
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(b)  each  instance  of  subcircuit  i  requires  computation  time  wi .  for  each  i  €{  1 }; 

(c)  j  is  the  subcircuit  with  the  maximum  computation  time  w;  *w  and 

(d)  k  is  the  subcircuit  adjacent  to  j  such  that  no  other  subcircuit  adjacent  to  j  has  a 
larger  computation  time; 

then 

Dw  (.Tqj  m  )  2w 

-  4  - .  (2.12) 

Dw(TG;.m) 

Each  longest  path  in  ^ GJ.m  and  each  path  of  maximum  weight  in  ^OJ.m  has  exactly  m 
vertices,  one  in  each  iteration.  Clearly  the  path  with  the  largest  weight  in  the  aug¬ 
mented  graph  contains  all  the  instances  of  the  subcircuit  with  the  largest  weight.  Hence 

A. (2.13) 
In  the  unaugmented  graph,  there  exists  a  path  starting  at  (1.  j )  which  alternates 

between  instances  of  subcircuits  j  and  k .  Consequently. 


datg 


Since  wmMX^wk ,  it  follows  that 


,)*  J  ", 


m 

ih 


(2.14) 


D~  (TGJ  m  )  >  Y  ("m«+"t  >• 


(2.15) 


Combining  (2.13)  and  (2.15)  produces  the  inequality  in  (2.12).  completing  the  proof. 


Theorem  2.3  implies  that  if  the  largest  subcircuit  is  adjacent  to  a  subcircuit  of 
nearly  the  same  size,  then  the  parallel  completion  time  will  not  be  significantly  affected 
by  the  augmented  arcs  of  the  task  graph.  However,  if  the  largest  subcircuit  is  adjacent 
only  to  subcircuits  which  are  much  smaller,  then  the  loss  of  parallelism  will  approach 
the  worst  case  factor  of  2. 


K  Conclusion 


Four  parallel  algorithms  for  waveform  relaxation  have  been  presented  and 
analyzed  using  a  simplified  computation  model.  The  parallel  algorithms  are  derived  by 
using  Gauas-Seidel  and  Gauss-Jacobi  relaxation  in  combination  with  the  full  window 
technique  or  the  time  point  pipelining  technique  for  exploiting  parallelism.  The  choice 
between  Gauss-Jacobi  and  Gauss-Seidel  represents  a  tradeoff  between  greater  parallel¬ 
ism  and  faster  convergence,  and  the  choice  between  time  point  pipelining  and  the  full 
window  technique  represents  a  tradeoff  between  greater  parallelism  and  lower  over¬ 
head.  Task,  graph  models  for  the  algorithms  have  been  defined  which  serve  as  the  basis 
for  analysis  and  implementation  of  the  algorithms  in  subsequent  chapters. 
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CHAPTER  3 


GAUSS-JACOBI/GAUSS-SETOEL 


3 ARISON  THEORY 


The  properties  of  the  Gauss-Jacobi  and  Gauss-Seidel  methods  for  waveform  relax¬ 
ation  are  closely  related  to  the  more  basic  Gauss- Jacobi  and  Gauss-Seidel  algorithms  for 
the  solution  of  systems  of  linear  algebraic  equations,  which  are  examined  in  this 
chapter.  The  algorithms  are  compared  in  terms  of  convergence  speed  and  parallelism. 
For  the  systems  of  equations  which  arise  at  each  time  point  in  the  solution  of  node 
equations  of  MOS  circuits,  when  the  time  step  is  sufficiently  small,  it  will  be  shown 
that  parallel  Gauss- Jacobi  is  asymptotically  faster  than  parallel  Gauss-Seidel.  when  a 
sufficiently  large  number  of  processors  is  used.  The  theorem  which  establishes  this 
result  relates  the  spectral  radii  of  the  iteration  matrices  to  the  available  parallelism  of 
the  methods.  A  formula  is  also  derived  which  compares  the  parallelism  of  the  two 
methods  terms  of  the  structural  properties  of  the  equations  being  solved. 

3.1  Gauss-Jacobi  and  Gauss-Seidel  Relaxation 


Consider  the  problem  of  solving  Ax  *6  by  relaxation,  where  x.  b  €R*  .  A  €R"  *" 
is  nonsingular,  and  the  diagonal  elements  of  A  are  nonzero.  The  i,h  equation  is  solved 
independently  for  x*  while  using  previously  computed  or  guessed  values  for  the  other 
variables.  The  update  equation  for  the  i,h  vector  element  on  the  kth  iteration  for 
Gauss- Jacobi  is  given  by 


x <* > «  _L 

x.(*_,)-  f  «..x<“,) 

*  a.  j  Lm 

X,  *  - 

;a  1  j*i 

and  for  Gauss-Seidel  by 
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|  Let  D .  L  ,  and  V  be  diagonal,  strictly  lower  triangular,  and  strictly  upper  triangular 

|  matrices,  respectively,  such  that  A  mL  +D  +U .  Then  the  Gauss- Jacobi  and  Gaum-Seidel 

iteration  matrices  are  given  by  hiCJ  m~D  ~\L  -HJ )  and  Atcs  m-iL  +D  )~lU .  respec- 

!  lively.  The  asymptotic  convergence  rates  are  related  to  pCMe7  )  and  piM0S  ).  where  p 

i 

denotes  spectral  radius,  since  these  are  the  factors  by  which  the  errors  are  reduced  on 

« 

each  iteration  for  general  initial  guesses,  as  the  iteration  count  approaches  infinity. 

The  Stein-Rosenberg  theorem  [Var62]  relates  pCMe;)  and  p(.Mas).  and  conse¬ 
quently  the  convergence  speeds  of  Gauss-Jacobi  and  Gauss-Seidel.  for  a  class  of 
matrices. 

Theorem  3.1.  Stein-Rosenberg :  If  MaJ  is  nonnegative  and  piMaj)<  1,  then 

The  condition  that  MCJ  is  nonnegative  is  equivalent  to  requiring  that  /a0  <0  for  all 
i#;.  The  condition  p(fi#C7)<l  is  necessary  and  sufficient  to  assure  convergence  of 
Gauss-Jacobi.  Matrices  arising  in  the  transient  analysis  of  MOS  circuits  satisfy  both 
these  conditions  when  the  time  step  is  sufficiently  small,  provided  that  a  capacitor  is 
present  from  each  node  to  ground,  and  the  gate-drain  and  gate-source  capacitances  are 
included  in  each  MOS  transistor  [WhiS6].  For  these  matrices,  the  Stein-Rosenberg 
theorem  implies  that  for  a  sufficiently  small  error  tolerance.  Gauss-Seidel  will  generally 
converge  in  fewer  iterations  than  Gauss-Jacobi.  However,  on  parallel  processors,  it  is 
possible  to  perform  m  iterations  of  Gauss-Jacobi  in  less  time  than  m  iterations  of 
Gauss-Seidel.  Therefore.  Theorem  3.1  does  not  indicate  which  method  will  be  faster  on 


parallel  processors. 


u 


3.2  Pirilkl  Gum  lin>H  ud  Gauaa -Seidel 

The  unknown  update  equations  for  Gauss-Jacobi  and  Gauss-Seidel.  (3.1)  and  (3.2). 
involve  the  same  computation:  in  each  case  the  irt  unknown  is  updated  by  summing 
n  -1  products  and  a  constant.  Parallelism  can  be  exploited  in  computing  the  products 
and  performing  the  summation,  but  the  parallelism  of  these  computations  will  be  the 
same  for  both  Gauss-Jacobi  and  Gauss-Seidel.  The  difference  between  Gauss-Jacobi  and 
Gauss-Seidel  that  affects  bow  much  total  parallelism  can  be  exploited  is  that  in  the  case 
of  Gauss-Jacobi  all  the  unknowns  can  be  updated  simultaneously  on  each  iteration, 
whereas  in  the  case  of  Gauss-Seidel  the  number  of  unknowns  which  can  be  updated 
simultaneously  is  limited  by  data  dependencies  between  different  unknowns  of  the 
same  iteration. 

In  order  to  examine  the  difference  between  the  two  methods,  let  (3.1)  and  (3.2)  be 
treated  as  atomic  operations  which  can  be  computed  in  one  processor  step.  Using  this 
convention,  and  assuming  that  at  least  n  processors  are  available,  one  iteration  of 
Gauss-Jacobi  takes  one  processor  step,  as  all  the  unknown  updates  can  be  performed 
simultaneously,  and  one  iteration  of  Gauss-Seidel  takes  n  processor  steps  if  A  is  full, 
as  the  »'*  unknown  must  be  updated  before  the  i  +1  update  equation  can  be  completed. 

When  A  is  sparse,  it  is  possible  to  exploit  additional  parallelism  in  Gauss-Seidel  to 
reduce  the  number  of  steps  required  for  one  iteration  to  well  below  n  .  The  sparsity  of 
A  allows  some  updates  of  a  given  iteration  to  be  done  simultaneously.  For  example,  if 
«i+,  .  *0.  then  x/* 1  can  be  updated  simultaneously  with  x/*/ .  It  is  also  possible  to 
begin  iteration  *  41  before  completing  iteration  k .  For  example,  if  ax  j  through  a ,, 
are  all  aero,  then  one  can  compute  x  ,(t  +I)  without  waiting  for  xjk  ’  through  x„(i  1  to  be 
computed  first. 


m,v  v  V  '  *  V  .l 


The  number  of  steps  required  to  compute  m  Gtuss-Seidel  or  Gauss-Jacobi  itera¬ 
tions  on  a  sparse  matrix  can  be  determined  from  an  m  -iteration  task  graph.  TCS  m  or 
ToJ.m  •  respectively.  Each  vertex  represents  the  task  of  performing  an  update  as 
specified  by  (3.1)  or  (3.2).  and  each  arc  represents  a  data  dependency.  The  graph  can  be 
constructed  based  on  the  nonzero  structure  of  A .  First  the  nonzero  structure  of  A  is 
represented  by  a  directed  graph  G  which  is  analogous  to  the  subcircuit  graph  defined 
previously.  The  vertices  of  G  are  numbered  1  to  a.  corresponding  to  the  vector  ele¬ 
ment  numbers,  and  each  off-diagonal  element  aiJt  is  represented  by  an  arc  from  i  to  /  . 
For  example.  Fig.  3.1  shows  the  nonzero  structure  of  a  matrix  and  its  corresponding 
graph  G  .  The  task  graphs  T6J.m  and  Tos  m  can  be  constructed  based  on  G  in  the  same 
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Figure  3.1.  (a)  Nonzero  structure  of  A  :  (b)  the  corresponding  gnpb  C . 


manner  as  described  m  Chapter  2. 


3J  The  Parallelism  Ratio 

A  comparison  of  the  times  required  for  the  parallel  Gauss-Jacobi  and  parallel 
Gauss-Seidel  methods  must  address  their  relative  degrees  of  parallelism.  The  parallel¬ 
ism  ratio  specified  in  the  following  definition  will  be  used  in  comparing  the  asymptotic 
convergence  behavior  of  the  methods  as  the  number  of  iterations  goes  to  infinity. 

Definition  3.1.  Let  the  parallelism  ratio  r  be  defined  by  the  equation 

D(TaSm) 

r  •  lim - .  (3.3) 

m  —  D(rcj  m  ) 

'■*» 

The  existence  of  the  limit  and  the  relationship  of  r  to  the  structure  of  A  will  be  treated 
in  a  subsequent  section  of  this  chapter.  Equation  (3.3)  implies  that  m  iterations  of 
Gauss-Seidel  require  r  times  as  many  processor  steps  as  m  iterations  of  Gauss-Jacobi.  in 
the  limit  as  the  iteration  count  goes  to  infinity.  The  ratio  r  is  directly  related  to  the 
number  of  Gauss-Seidel  and  Gauss-Jacobi  iterations  that  can  be  performed  in  a  given 
number  of  processor  steps,  and  this  relationship  is  given  in  Lemma  3.4  following  a 
definition  and  several  other  lemmas  which  esut  ish  some  bask  properties  of  the  task 
graph  depths.  First,  functions  are  defined  whkh  represent  the  maximum  number  of 
iterations  that  can  be  performed  in  a  given  number  of  processor  steps. 

Definition  3.2.  Define  the  functions  mC5,mc/: such  that  mGS  (l )  is  the  largest 
integer  satisfying  D{TGym  ,  and  rr^jil )  is  the  largest  integer  satisfying 

D^TOJ.majU)^1  ■ 

In  the  degenerate  case  in  whkh  G  contains  no  cycles,  the  exact  solution  is  obtained 
in  a  finite  number  of  iterations,  as  reflected  in  the  following  lemma. 


Lrauu  3.1.  If  G  does  not  contain  a  cycle,  and  Si  &0  is  the  length  of  a  longest  path  in 
C  .then 

D  (Tqj.  m  )»D  (,Tas.  m  +1.  for  all  m  >Si  +1.  (3.4) 

If  Af  »0.  G  contains  no  arcs,  and  the  task  graphs  contain  no  arcs.  Therefore,  the  task 
graph  depths  are  1.  If  G  contains  a  longest  path  P  of  length  St  >0.  then  there  exist 
corresponding  paths  PGJ  in  T0J  m  and  Pes  in  Tos  m  both  of  length  St .  If  either  rc/  oo  or 
T 05.  m  contains  a  longer  path,  then  there  is  a  corresponding  path  in  G  which  contradicts 
the  fact  that  P  is  a  longest  path.  Therefore.  D  (TG}ma)xD  (TGS  aB)=Si  +1.  Since  adja¬ 
cent  vertices  pf  a  task  graph  are  always  in  the  same  or  consecutive  iterations,  any  task 
graph  for  at  least  Si  41  iterations  will  conuin  the  longest  path  of  length  Si ,  and  have 
depth  Si  +1. 

In  the  more  interesting  case  in  which  G  contains  a  cycle,  the  Gauss- Jacobi  itera¬ 
tions  proceed  at  a  rate  of  exactly  one  iteration  per  processor  step. 

Lemma  3.2.  If  G  contains  a  cycle,  then  D  (Taj  m  )*m  ,  and  mc/  (Z  )*/ ,  for  allm.l  €Z.. 

Due  to  the  construction  of  Toj.m  .  the  vertices  in  any  path  must  have  consecutive  itera¬ 
tion  numbers.  Therefore.  D(TGJm  )<m .  Since  G  contains  a  cycle,  it  contains  a  walk 
of  length  m  —  1.  and  such  a  walk  corresponds  to  a  path  of  length  m  —1  in  Tor.-  There¬ 
fore.  D(T6jm)^m.  Hence.  D  (TGJ  m  )*m .  The  relationship  )-l  follows 

immediately,  using  the  definition  of  mc/ . 

The  Gauss-Seidel  task  graph  depth  is  more  difficult  to  characterize.  In  fact  the 
expression  &(Tes  m+t)~D(TGS  m  )  is  not  necessarily  constant  as  a  function  of  m  .  Some 
useful,  but  relatively  weak,  properties  of  the  Gauss-Seidel  task  graph  depths  are  given 
in  the  following  lemma. 


Lemma  33.  If  C  contains  a  cycle,  then  the  sequences  D(TGS  m  )  and  m^g  (/ )  are 
unbounded  and  monotone  nondecreasing  in  m  and  l ,  respectively. 


Since  G  contains  a  cycle,  it  contains  a  walk  of  infinite  length  which  maps  to  a  path  of 
infinite  length  in  Tos.m-  Consequently,  the  task  graph  depth  goes  to  infinity.  The  addi¬ 
tion  of  one  more  iteration  to  any  task  graph  of  finite  iterations  does  not  remove  any  arcs 
or  vertices  from  the  original  graph  and  therefore  does  not  reduce  the  depth.  Hence,  the 
depth  is  monotone  nondecreasing.  As  a  result.  mos  (l )  must  also  be  unbounded  and 
monotone  nondecreasing. 

The  ratio  r  can  now  be  related  to  the  number  of  iterations  which  can  be  per¬ 
formed  in  a  given  number  of  processor  steps.  The  following  lemma  indicates  that  the 
Gauss- Jacobi  method  can  perform  r  times  as  many  iterations  as  Gauss-Seidel.  in  a  given 
number  of  processor  steps,  in  the  limit  as  the  number  of  processor  steps  goes  to  infinity, 
for  the  nondegenerate  case  where  G  contains  a  cycle. 


Lemma  3A  If  G  contains  a  cycle,  then 


me7  (Z) 

lim - a  r.  (3.5) 

I  -eo  niQg  (Z ) 


Due  to  Lemma  3.2  and  Def.  3.1.  for  any  e>0  there  exists  M  €Z  such  that 


which  implies 


—  r 


m 


<  e,  for  all  m  >M  , 


(3.6) 


D  (Tcs  m )  *  [8(m  )+r  ]m  .  for  all  m> M.  (3.7) 

where  |$(m  ))<€.  Since  (Z )  is  unbounded  and  nondecreasing,  there  exists  L  €Z  such 

that  mjjjfZ  )>M ,  for  all  Z  >L .  The  definition  of  mQS  and  (3.7)  imply  that  for  Z  >L  , 

mGS  (Z )  is  given  by  the  largest  integer  m  satisfying 
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[5(m  )+r  Jm  <  1.  (3.8) 

Consequently,  for  any  c>0  there  exists  L  such  that 

— - 7— —  .  for  all  /  >  L .  (3.9) 

SCn^g  (l  ))+r 

where  (l  )))<«.  Dividing  the  equality  mG/  Cl  )*I  by  (3.9)  yields 

mGJCO  l 

- s - .  for  all  l  >  L. 

nGS(l)  l  (3.10) 

SCmos  0  ))+r 

Since  8(mGS  (Z ))  can  be  made  arbitrarily  small  by  making  L  large.  (3.5)  must  hold,  and 
the  proof  of  the  lemma  is  complete. 

3.4  Parallel  Convergence  Speed 

If  m  iterations  of  Gauss-Seidel  can  be  completed  in  l  processor  steps,  then  rm 
iterations  of  Gauss-Jacobi  can  be  completed  in  the  same  number  of  processor  steps,  in 
the  limit  as  l  goes  to  infinity.  In  the  limit,  the  error  is  multiplied  by  a  factor  of 
f>CM6S  )m  in  m  iterations  of  Gauss-Seidel.  and  it  is  multiplied  by  a  factor  of  pCMc/  )rm 
by  Gauss-Jacobi  in  an  equal  number  of  processor  steps.  Therefore.  Gauss-Jacobi  will  be 
asymptotically  faster  than  Gauss-Seidel  if  pCMajY  ^pCM0S).  In  the  following 
theorem,  which  is  the  main  result  of  this  chapter,  it  will  be  shown  that  this  relationship 
between  the  spectral  radii  holds  for  the  class  of  matrices  to  which  the  Stein-Rosenberg 
theorem  applies.  Therefore,  parallel  Gauss-Jacobi  is  asymptotically  faster  than  parallel 
Gauss-Seidel  for  these  matrices  [Sma88a]. 

Theorem  3.2.  If  MCJ  is  nonnegative  and  pCMGJ )  <  1 ,  then  pCMCJ  )T  ^pCMcs ). 

If  p(A#G/)*0  then  pCMas  )=0.  and  the  theorem  is  trivially  satisfied  [Var62].  For  the 
case  where  pCMGJ  )>0.  G  must  contain  a  cycle  because  the  iterations  do  not  converge  in 
a  finite  number  of  iterations.  For  this  case,  the  proof  of  Theorem  3.2  utilizes  the 


v,  * 

V  ■  • 
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following  lemma  [Gan59]  and  definition: 


Lemma  3.5.  Perron-Frobenius:  If  matrix  M  €  R*  is  nonnegative,  then  it  has  a  nonne¬ 
gative  real  eigenvalue  equal  to  its  spectral  radius  and  a  nonnegative  eigenvector  associ¬ 


ated  with  that  eigenvalue. 


Definition  33.  Let  y(,\  z(i)€R*  contain  the  most  recently  computed  iterate  values  for 
each  vector  element  after  l  processor  steps  of  the  parallel  Gauss-Jacobi  and  Gauss-Seidel 
algorithms  respectively. 


i  ince  a  Gauss-Jacobi  iteration  finishes  in  one  processor  step,  y )  is  equal  to  the  llh 
Gauss-Jacobi  iterate,  and  therefore 


b-  n  a-  i 

—  +  i  - 

a-  a- 

M  J  mi  J  ml 


(3.11) 


In  general.  z' '  never  corresponds  to  any  iterate  of  Gauss-Seidel.  because  overlapping  of 


the  iterations  implies  that  the  most  recently  computed  element  values  can  correspond  to 
different  iterations.  And  because  of  data  dependencies  inherent  in  the  Gauss-Seidel 
algorithm,  many  of  the  elements  of  z(/)  are  the  same  as  those  of  z(,_1).  Therefore,  each 
Gauss-Seidel  update  will  be  given  by  either 


(3.12a) 


2/°=— +  I 


(3.12b) 


Q  ■  C- 

«.«  jmijm  i  .i 


where  m  (/,*./  ) 6  {0,  •••,/—  1 ).  Note  that  m  (l,  i,  j  )  is  used  to  indicate  that  in  order  to 


follow  the  Gauss-Seidel  update  formula,  it  may  be  necessary  to  pick  out  elements  from 
several  different,  but  earlier,  z  vectors. 

To  complete  the  proof,  consider  the  problem  of  solving  Ax  =0  by  relaxation, 
where  A  is  such  that  Ma]  exists  and  is  nonnegative  and  0<piMCj)<l.  Let  the  initial 


V-V* 


•  *>**•' 
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guess  x(0)-y<0)^z<0)  be  a  nonnegative  eigenvector  associated  with  a  nonnegative  eigen¬ 
value  of  MGj  equal  to  p(Afe;).  Since  MG}  is  nonnegative.  at  J  /ait  <0  for  all  i  &j. 
Consequently,  since  x<0)  is  nonnegative  and  b  =0,  each  term  in  (3.11)  and  (3.12b)  is 
nonnegative  for  all  i,l .  Also,  since  y<0)  is  an  eigenvector  associated  with  an  eigenvalue 
equal  to  p(MGj  )  <  1 .  y  U)=p(AfG;  )( y  <0\  and  therefore  y,(,)  decays  monotonically  with  l 
for  all  i . 


It  will  be  shown  by  induction  that 

y/ (n<ri<m).  for  alii,  for  all  m  €{0.  •  ,l\.  (3.13) 

holds  for  all  l.  Clearly  (3.13)  holds  for  l  =0,  forming  the  basis  of  the  induction. 

Assume  that  (3.13)  holds  for  a  given  l .  and  consider  processor  step  l  +1.  In  those  cases 

where  (3.12a)  applies,  y/1  and  y/i+1)^z1(,+1)  because  y,(l)  is  monotone 

decreasing  in  l .  In  those  cases  where  (3.12b)  applies,  each  term  of  the  summation  in 

(3.11)  is  less  than  or  equal  to  the  corresponding  term  in  (3.12b)  and  all  the  terms  are 

nonnegative.  Hence  Since  yf(/)  decreases  monotonically, 

yi{l+1)^yi(n)^zi{m)  for  all  m^l.  Consequently.  y/,+1)<zj(m>  for  all  i  and  all 

m  €{0.  •  •  •  .  1 41}.  thus  completing  the  induction. 


<"•««  » 


In  /  processor  steps.  mGS  ( l )  iterations  of  Gauss-Seidel  are  completed.  Let  x 
be  the  Gauss-Seidel  iterate  on  iteration  mGS  (l ).  Then  for  any  i.l .  there  exists  m 


U ))  (m  \  //\ 

such  that  xi  .  Applying  (3.13)  for  each  i .  it  follows  that  y 


where  y(/)  is  equal  to  the  Ith  Gauss-Jacobi  iterate.  As  l  -*oo,  Lemma  3.4  indicates  that 


r  times  as  many  Gauss-Jacobi  iterations  are  completed  as  Gauss-Seidel  iterations.  In 


order  for  x  to  remain  larger  than  y^‘\  MGS  must  have  an  eigenvalue  with  mag¬ 

nitude  larger  than  p(M0/  Y  .  the  factor  by  which  the  Gauss-Jacobi  error  is  reduced  in  r 
processor  steps.  Therefore.  piMCJ  Y  ^p(Mcs ).  and  the  proof  is  complete. 


\ !  v* \ , V .V.V, V/lV** * V . V iVl v*f 


msssam 


46 


3.5  Parallelism  Ratio  Bounds 

It  is  interesting  to  note  that  the  proof  of  Theorem  3.2  does  not  require  explicit 
knowledge  of  the  value  of  r  .  but  rather  only  assumes  that  r  exists.  The  value  of  r  is 
of  interest  because  it  represents  the  degree  of  parallelism  of  Gauss-Jacobi  compared  to 
Gauss-Seidel.  On  the  average,  parallel  Gauss-Jacobi  requires  r  times  as  many  proces¬ 
sors  as  parallel  Gauss-Seidel  if  the  full  parallelism  of  the  methods  is  employed.  In  this 
section,  several  bounds  on  r  will  be  presented,  assuming  that  r  exists.  In  the  next  sec¬ 
tion.  the  existence  of  r  will  be  established,  and  an  exact  formula  for  r  will  be  given  in 
terms  of  properties  of  G  . 

In  the  usual  case  where  G  contains  a  cycle,  and  m  Gauss-Jacobi  iterations  require 
exactly  m  processor  steps,  r  is  equivalent  to  the  average  number  of  steps  between  the 
completion  of  successive  Gauss-Seidel  iterations.  If  no  parallelism  is  employed  in 
Gauss-Seidel.  n  extra  steps  are  required  for  each  extra  iteration.  If  parallelism  is  util¬ 
ized.  the  number  of  steps  per  iteration  will  be  no  greater  than  n  ,  and  therefore 

r  <n.  (3.14) 

This  bound  will  be  reached  if  A  is  full,  or  if  G  contains  a  cycle  from  vertex  1  to  2  to  ... 

to  n  to  1. 

If  concurrent  updates  are  allowed  within  a  Gauss-Seidel  iteration,  then  D  (Tcs  2) 
steps  will  be  required  for  the  first  iteration.  And  if  parallelism  is  not  exploited  between 
different  iterations,  each  additional  iteration  will  require  an  additional  D  ( Tos  j)  steps. 
Since  some,  but  not  all.  of  the  potential  parallelism  is  utilized,  it  follows  that 

r  <  D(Tgs  ,)  ^  *.  (3.15) 

This  tighter  bound  on  r  will  be  approached  when  nearly  all  of  the  Gauss-Seidel  paral¬ 
lelism  is  due  to  intra-iteration  parallelism. 


47 


In  some  cases,  most  or  all  of  the  available  Gauss-Seidel  parallelism  arises  from  the 
overlapping  of  iterations.  For  example,  if  A  is  tridiagonal  then  D  (Tos  t)  is  equal  to  n  . 
since  each  update  requires  the  result  of  the  previous  update  of  the  same  iteration.  How¬ 
ever.  iteration  £  +1  can  begin  after  only  2  steps  of  iteration  k .  Because  of  the  inter¬ 
iteration  parallelism,  r  =2  for  a  tridiagonal  matrix,  regardless  of  the  value  of  n .  In 
cases  like  this,  the  bounds  in  (3.15)  do  not  give  an  accurate  indication  of  the  high  degree 
of  Gauss-Seidel  parallelism.  An  exact  determination  of  the  value  of  r  requires  that 
both  inter-iteration  and  intra-iteration  parallelism  be  taken  into  account. 


3.6  Parallelism  Ratio  Formula 

In  this  section,  the  limit  in  (3.3)  which  defines  r  will  be  shown  to  exist,  and  a  for¬ 
mula  will  be  derived  for  r  in  terms  of  properties  of  G  .  In  the  following  development. 
L  (F )  will  denote  the  length  of  path  P .  and  D(v)  will  denote  the  depth  of  vertex  v . 
defined  as  the  number  of  vertices  in  a  longest  path  terminating  at  v .  The  notation 
D  (T  ).  where  T  is  a  directed  graph,  will  denote  the  depth  of  the  graph  as  defined  in 
Chapter  2.  The  main  result  of  this  section  is  given  in  Theorem  3.3. 


Theorem  33.  The  limit  in  (3.3)  exists,  and 

1 +/  (C  )/h(C  )  .  if  G  contains  a  cycle 
r  *  „ 

1  ,  otherwise 

where  C  is  a  cycle  of  G  which  maximizes  f  (C  )/b  (C  ). 


(3.16) 


Note  that  any  cycle  of  G  must  contain  a  feedback  arc.  so  b(C)  will  be  nonzero  in 
(3.16)  when  G  contains  a  cycle. 

If  G  contains  no  cycles,  then  D{TOJ  m  )=D  (TOS  m  )  for  m  sufficiently  large,  due  to 
Lemma  3.1.  Therefore,  r  =1  in  this  case. 


If  G  contains  a  cycle,  then  both  the  Gauss-Seidel  and  Gauss-Jacobi  task  graph 
depths  go  to  infinity,  and  D  (TGJ  m  )»m .  To  complete  the  proof,  it  must  be  shown  that 
limU>(r05  m)/m  ]*1+/  (C )/6 (C ).  when  G  contains  a  cycle.  This  will  be  done  in  the 
next  two  lemmas  by  establishing  upper  and  lower  bounds  on  the  limit. 

Lemma  3 16.  If  G  contains  a  cycle,  then  far  any  €>0  there  exists  M  EZ  depending  on  € 
such  that 

D ( Tas  m)  t  (r  ) 

- : —  >  1  + - e.  for  all  m  >Af  .  (3.17) 

m  6(C) 

where  C  is  a  cycle  of  C  which  maximizes  f  (C  )/b  (C  ). 


Let  v  be  a  vertex  of  C .  For  any  j  >  1.  let  m  *jb  (C  )+l.  Let  v ,  and  vM  be  the  instances 
of  v  in  iterations  1  and  m  of  ^OS.m  •  respectively.  There  is  a  path  P  from  v,  to  vm 
corresponding  to  j  traversals  of  C .  because  the  iteration  number  increases  by  one  every 
time  a  feedback  arc  is  traversed.  Combining  the  relations  D{TCSm  )^D(vm ). 
Z)(vm)>/)(v1)+L(/»).  D(Vl)>l.  LCP)«;[/(C)+6(C)J.  and  /»(m-l)/6(C)  pro¬ 
duces 

D(TCS  m  >  >  1+-— -  (/  (C  )+6(C  )].  for  m  €{l.  6(C  )+l.  2b  (C  )+l.  •  }.(3.18) 

6(C) 

Bounds  on  the  task  graph  depth  for  intermediate  values  of  m  can  be  obtained  using  the 
fact  that  D(Tas  m)  is  monotone  nondecreasing  in  m.  The  following  lower  bound  on 
D  (Tgs  m  )  applies  for  all  m : 

m  —6  (C  ) 

D(TOSm)  >  1+ - (/  (CHMC)l.  (3.19) 

6(C) 

Dividing  by  m  gives 

|/(C)+6(C)|.  (3.20) 

Discarding  the  1/m  term  and  distributing  yields 


D(rCSm)  ! 

- £  _+ 

m  m 


6(C) 


1 

m 
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D(T0Sm)  f(C )  /(CHMC) 

- >  1  +  - - - - . 

m  b(C )  m 

Note  that  /  (C )+b  (C  )<«.«*  C  contains  at  moat  n  — 1  arcs.  Therefore. 


(3.21) 


D(TGSm)  /(c)  n 

- >  1  + - . 

m  6  (C )  m 

By  setting  3#  »n  /«.  (3.17)  is  obtained,  thus  completing  the  proof  of  Lemma  3.6. 


(3.22) 


Lemma  3.7.  If  G  contains  a  cycle,  than  for  any  e>0  there  exists  At  €Z  depending  on  e 
such  that 


WaSm)  /(C) 

- <  1  + - +  «.  for  all  m  >At . 

m  b(C) 

whore  C  is  a  cycle  ofG  which  maximizes  f  (C  )/b  (C  ). 


(3.23) 


Let  u  be  any  vertex  of  G  and  let  um  be  the  instance  of  u  in  iteration  m  of  TOS  m .  Let 
P  be  any  longest  path,  consisting  of  one  or  mote  vertices,  terminating  at  um .  In  order 

to  establish  a  bound  on  the  length  of  P.  partition  P  into  subpaths  Pq.  Qv  P2-  Qt . 

Pt. j.  Q, .  P, .  such  that  the  following  conditions  are  satisfied: 

(a)  The  concatenation  of  the  arcs  of  P0.  Q -  -  .  Pt  is  equal  to  the  sequence  of  arcs 

in  P . 

(b)  Pj  contains  no  two  instances  of  a  common  vertex  of  G  .  for  each  j . 

(c)  The  endpoints  of  Q}  are  instances  of  a  common  vertex  of  G  .  for  each  j . 

(d)  The  origins  of  Q,  and  Qt  are  instances  of  distinct  vertices  of  G  .  for  each  i  tPj . 

The  possibilities  that  s  might  be  0  and  that  P0  or  P ,  might  be  null  are  not  ruled  out. 
Note  that  conditions  (a),  (b).  and  (c)  are  easily  satisfied  by  putting  subpaths  with  more 
than  one  instance  of  the  same  vertex  into  Q  subpaths.  If  a  partitioning  does  not  satisfy 
condition  (d).  it  can  be  modified  by  combining  Q,  .  Pt . QJ  into  a  single  Q,  subpath. 


so 

Since  P  is  a  longest  path  terminating  at  um .  D  (ua  )mL  {P  Hi.  Condition  (a)  that 
implies 

jm  t 

>  (3.24) 

J  "0  J»l 

Baaed  on  this  equation,  an  upper  bound  on  D  (um  )  will  be  derived  which  will  lead  to 
the  result  in  (3.23).  Condition  (b)  implies  that  L  {P}  )<n  -1.  and  condition  (d)  implies 
s  4 \n .  Applying  these  relations  to  (3.24).  plus  the  fact  that  L  (Qy  )■/  (Qy  )+b  (Q; ). 
yields 

S 

/>(«„,)<  l+(n  +lXn  -1H  £  [/  (Qj  )+*  ( Qj )].  (3.25) 

i*  i 

Recall  that  the  first  and  last  vertices  of  Qj  are  instances  of  a  single  vertex  x>  of 
C  .  To  obtain  a  bound  on  /  (Qj  ).  Qt  is  further  partitioned  into  one  or  more  subpaths. 
Qj  j . Q)  t .  such  that  each  subpath  starts  and  ends  on  an  instance  of  xf .  and  no  inter¬ 

nal  vertex  is  an  instance  of  Xj .  Each  Qj  t  corresponds  to  a  cycle  Ct  t  of  G .  Because  of 
the  manner  in  which  cycle  C  is  chosen.  /  (C^  ()/KC^  ()4/  (C  )fb (C ).  Using  this 
relationship  and  summing  over  all  subpaths  of  Qj .  the  bound 
/  (fi>  )^MQ;  )/  (C  )/b  (C  )  is  obtained.  Substituting  this  bound  into  (3.25)  produces 


Z>(um)  <  nJ+ 


1+ 


/(C) 

MC) 


£MC;). 
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(3.26) 


Since  (3.26)  applies  for  any  choice  of  vertex  um  in  iteration  m .  D  (um  )  may  be  replaced 
by  D(TCS  m  ).  Applying  the  relationship  £h (Q/  )^6(/,)<m  and  dividing  by  m  yields 
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DUOS*,)  nz  /(C) 

-  <  —  +  1  + - 

m  m  6  (C ) 


(3.27) 


Choosing  M*n  J/*  confirms  the  validity  of  (3.23)  and  completes  the  proof  of  Lemma 


3.7.  Lemmas  3.6  and  3.7  lead  directly  to  the  result  stated  in  Theorem  3.3  for  the  case 
where  G  contains  a  cycle,  thus  completing  the  proof  of  the  theorem. 


CHAPTER  4 


SPEEDUP  ESTIMATES 


The  objective  of  parallel  processing  is  to  reduce  the  overall  run  time  of  a  program 
by  executing  different  parts  iff  it  on  different  processors  concurrently.  A  convenient 
measure  of  the  success  of  parallel  processing  in  a  given  situation  is  the  speedup,  which 
indicates  how  much  faster  the  program  runs  on  a  specified  number  of  processors  com¬ 
pared  to  the  run  time  on  a  single  processor.  In  this  chapter,  several  techniques  are 
investigated  for  estimating  the  speedup  of  parallel  waveform  relaxation  algorithms  for 
a  set  of  benchmark  circuits.  The  simplifying  assumptions  introduced  in  Chapter  2  are 
used  as  a  starting  point  for  computing  simple  estimates.  The  assumptions  are  then 
replaced  by  more  realistic  assumptions  producing  more  accurate  estimates.  In  the  pro¬ 
cess  of  refining  the  assumptions,  insights  are  gained  into  the  extent  to  which  different 
factors  affect  the  speedup.  In  subsequent  chapters,  the  speedup  estimates  are  used  for 
two  purposes.  Fast  estimates  are  used  to  select  the  fastest  parallel  waveform  relaxation 
algorithm  prior  to  performing  a  circuit  simulation.  Accurate  estimates  excluding  mul¬ 
tiprocessing  overhead  factors  are  compared  with  measured  speedups  to  determine  the 
extent  to  which  overhead  affects  the  performance  of  the  algorithms.  Finally,  in  this 
chapter,  accurate  estimates  neglecting  overhead  are  used  to  predict  the  potential  perfor¬ 
mance  of  the  algorithms  when  the  number  of  processors  is  large. 


The  parallel  processing  speedup  achieved  by  an  algorithm  X .  applied  to  a  given 


problem,  running  on  k  processors,  is  defined  as 
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Sk  «  —  .  (4.1) 

where  1 4  m  the  run  lime  of  algorithm  X  on  k  procemors  and  r,  ki  reference  time 
which  in  some  aenee  reflects  the  time  required  to  solve  the  same  problem  using  only  1 
processor.  The  speedup  is  a  measure  of  how  much  faster  the  program  runs  on  k  proces¬ 
sors  compered  to  the  time  required  to  obtain  the  same  solution  on  1  processor. 

One  obvious  choice  for  the  reference  time  is  the  time  required  to  run  the  same 
algorithm.  X .  on  1  processor.  Speedups  computed  in  this  manner  will  be  referred  to  as 
unnormalized  speedups.  denoted  as  5^  t .  Unnormalised  speedups  satisfy  the  property 
Sy  t  Ki  .  and  Sy  t  will  be  dose  to  *  if  the  computations  are  nearly  evenly  distributed 
between  all  the  processors  throughout  the  execution  of  the  program  and  if  the  parallel 
processing  overhead  is  small.  The  ratio  Sy  k/k  is  a  measure  of  the  processor  utilization, 
or  the  fraction  of  time,  on  the  average,  that  the  processors  are  busy,  not  counting  the 
time  spent  doing  extra  work  which  is  performed  in  the  *  -processor  case  but  not  in  the 
uniprocessor  case. 

When  using  speedups  to  compare  the  overall  performance  of  different  algorithms 
for  solving  the  same  problem,  a  common  reference  time  must  be  used  for  all  the  algo¬ 
rithms  so  that  a  greater  speedup  indicates  an  algorithm  with  a  shorter  run  time.  The 
normalized  speedup  of  algorithm  X  on  *  processors  with  respect  to  algorithm  Y  is 
defined  as 

Tr . i  Tr . l 

Sir  k  *  '  m$u,t  •  (4.2) 

Tx.t  Tx.t 

where  rz.  m  is  the  run  time  of  algorithm  Z  on  m  processors,  for  any  Z  and  m  .  A  good 
choice  for  the  common  reference  us  the  time  required  by  the  fastest  available  uniproces¬ 
sor  algorithm.  Thai  Sir  i  KSV  t  Kk  .  Note  that  Sy  *  /k  is  not  a  measure  of  the  utiliza¬ 
tion  of  the  processors  as  is  Sg,  t  /k  . 
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Of  the  waveform  relaxation  algorithms  under  consideration,  the  Gauss-Seidel 
method  consistently  runs  faster  than  Gauss- Jacobi  on  a  uniprocessor.  And  a  uniproces¬ 
sor  waveform  relaxation  program  which  does  not  have  the  extra  overhead  of  the  full 
window  technique  or  time  point  pipelining  will  run  faster  than  a  program  with  the 
extra  parallel  processing  code.  Therefore,  a  uniprocessor  program  using  Gaum-Seidel  is 
s  useful  reference  for  computing  normalized  speedups. 

4.2  Benchmark  Circuits 

The  performance  of  the  parallel  algorithms  is  a  strong  function  of  the  circuit  being 
simulated  and  the  subcircuit  partitioning,  since  the  subcircuit  interconnection  structure 
determines  the  structure  of  the  task  graphs.  Five  benchmark  circuits,  including  CMOS 
and  NMOS  designs,  are  used  to  compare  the  performance  of  the  parallel  algorithms. 
The  circuit  characteristics  are  summarized  in  Table  4.1.  All  coupling  is  bidirectional,  ao 
the  results  from  Chapter  2  concerning  such  circuits  are  applicable. 

Circuits  dvr  and  btn2k  are  portions  of  industrial  designs  which  were  extracted 
from  chip  layouts.  The  inclusion  of  nonzero  interconnect  resistance  on  the  power  and 
ground  busses  is  responsible  for  the  highly  nonuniform  subcircuit  sizes  in  dvs.  since 
there  are  a  large  number  of  tightly  coupled  bus  nodes  which  ere  placed  in  e  single 

Table  4.1.  Circuit  Characteristics 


FETs 

Nodes 

Subcircuits 

Nodes  per  Subcircuit 
min  max  mean  <r 

54 

189 

27 

1 

30 

7.0 

6.5 

116 

56 

30 

1 

9 

1.9 

1.6 

416 

150 

90 

mm 

MM 

1.7 

1.5 

805 

388 

119 

U 

SB 

3.3 

6.2 

698 

378 

222 

MM 

wm 

1.7 

1.2 

hn 
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•ubcircuil.  The  b*n2k  circuit  is  a  portion  of  a  static  RAM.  and  the  transistor  connec¬ 
tions  rather  than  parasitic  resistors  account  for  the  large  subcircuits  in  this  case.  The 
dpia  circuit  contains  only  a  few  interconnect  resistors,  and  the  sedae  and  digJL  circuits 
contain  none.  These  latter  circuits  have  a  comparatively  high  degree  of  uniformity  of 
subcircuit  sizes,  as  demonstrated  by  the  low  maximum  numbers  of  nodes  per  subcircuit 
and  by  the  small  standard  deviations. 


43  Presimulation  Estimates 

Two  categories  of  speedup  estimates  will  be  addressed:  presimulation  and  post¬ 
simulation  estimates.  The  presimulation  estimates  can  be  computed  without  perform¬ 
ing  a  simulation  of  the  circuit.  They  provide  insights  into  the  nature  of  parallelism  in 
waveform  relaxation,  and  they  can  serve  as  the  basis  for  selecting  the  algorithm  to  be 
used  for  a  given  circuit  on  a  given  number  of  processors  prior  to  performing  the  circuit 
simulation. 


4J.1  Type  1  estimates:  uniform  task  times 

Simplifying  assumptions  were  introduced  in  Chapter  2.  and  these  assumptions  are 
applied  to  the  benchmark  circuits  to  produce  the  Type  1  speedup  estimates  for  the  full 
window  technique.  The  assumptions  are  that  each  task  requires  the  same  amount  of 
time,  and  an  unlimited  supply  of  processors  is  available.  Lnder  these  conditions. 

(4.3) 

where  X  is  either  GS  or  GJ.  depending  on  the  algorithm  used.  The  Type  1  estimates  for 
the  benchmark  circuits  are  given  in  Table  4.2  for  different  numbers  of  iterations,  since 
the  number  of  iterations  required  for  convergence  is  not  known  prior  to  simulation. 
The  number  of  iterations  is  typically  around  4.  The  speedups  are  unnormalized, 
because  the  number  of  iterations  and  the  window  boundaries  chosen  by  the  automatic 
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Table  4.2.  Type  1  Speedup  Estimates 


Circuit 

Gaui 

1 

m-Seidel  Iterations 

2  4  8 

Gaum- Jacobi  Iterations 
1.  2.  4.  or  8 

dvs 

2.3 

2.3 

2.4 

2.4 

27 

dpla 

2.3 

2.9 

3.2 

3J 

30 

scdac 

10.0 

10.0 

12.0 

13.0 

90 

ben2k 

3.8 

4.7 

5.2 

5.6 

119 

digfi 

6.3 

7.2 

7.7 

7.9 

222 

windowing  algorithm  will  differ  by  unknown  amounts  for  the  two  relaxation  methods, 
and  these  factors  determine  the  ratio  of  uniprocessor  run  times. 

The  results  show  that  the  Gauss-Seidel  speedups  are  severely  limited  by  the  struc¬ 
ture  of  the  task  graphs.  Since  the  Gauss- Jacobi  speedups  are  10  to  20  times  greater  than 
the  Gauss-Seidel  speedups.  one  would  expect  Gains- Jacobi  to  outperform  Gauss-Seidel 
when  the  number  of  processors  is  large,  even  if  Gauss-Jacobi  requires  considerably  more 
iterations.  It  is  also  interesting  to  note  that  the  Gauss-Seidel  speedups  are  only  weakly 
dependent  on  the  number  of  iterations,  suggesting  that  most  of  the  parallelism  occurs 
between  tasks  of  an  iteration  rather  than  between  tasks  of  different  iterations.  The 
Gauss-Jacobi  speedups  are  equal  to  the  number  of  subcircuits,  and  are  independent  of 
the  number  of  iterations.  Since  all  the  task  times  are  assumed  to  be  uniform  and  all  the 
circuits  are  bidirectionally  coupled.  Theorems  2.2  and  2.3  imply  that  the  Type  1 
speedup  estimates  of  the  benchmark  circuits  are  independent  of  whether  the  unaug¬ 
mented  or  the  augmented.  T .  task  graphs  are  used. 

43.2  Type  2  estimates:  non  uniform  task  times 

The  uniform  task  time  assumption  automatically  causes  the  tasks  to  be  executed 
one  level  at  a  time  in  the  levelized  task  graph,  Even  if  the  task  times  are  not  uniform, 
it  is  possible  to  enforce  a  rule  which  requires  all  tasks  in  one  level  to  finish  before  any 
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tasks  in  the  next  level  begin  executing.  This  is  a  relatively  simple  way  to  enforce  the 
precedence  constraints  but  does  not  take  advantage  of  all  the  available  parallelism, 
because  some  tasks  in  level  l  +1  may  have  all  their  precedence  constraints  satisfied 
before  all  tasks  in  level  l  finish. 

In  the  Type  2  speedup  estimates,  the  tasks  are  forced  to  be  executed  one  level  at  a 
time,  but  the  task  times  are  allowed  to  be  nonuniform.  In  this  scenario,  the  time 
required  by  level  l  is  determined  by  the  largest  task  in  the  level.  The  speedup  is 
estimated  as 

I  ", 

,  « 1*11  ta*ksl 

j><r)  ‘  (4.4) 

£  max  ) 

lml  teltvtWI 

where  T  is  the  task  graph  and  ws  is  an  estimate  of  the  computation  time  for  task  x . 
Note  that  nonuniform  task  times  within  a  level  will  invariably  have  a  detrimental 
effect  on  the  speedup  in  this  model.  However,  differences  in  task  sizes  between  different 
levels  may  result  in  larger  or  smaller  speedups.  If  one  level  contains  many  tasks  and 
another  contains  few  tasks,  then  the  overall  speedup  will  benefit  if  the  tasks  in  the  level 
with  many  tasks  all  have  long  computation  times  compared  to  the  tasks  in  the  level 
with  few  tasks.  On  the  other  hand,  if  the  computation  time  of  levels  with  few  tasks 
dominate,  then  the  speedup  will  suffer. 

An  estimate  is  needed  of  the  computation  times  of  the  tasks.  Since  a  task  consists 
of  the  evaluation  of  a  subcircuit  over  a  time  window,  and  since  subcircuits  are  typically 
small,  the  task  computation  time  is  dominated  by  the  evaluation  of  model  equations. 
The  time  to  evaluate  the  model  equations  depends  on  the  number  of  models  to  be 
evaluated,  their  complexity,  the  number  of  time  points  at  which  the  models  need  to  be 
evaluated,  and  the  operating  regions  of  the  circuit  elements.  The  number  of  time  points 


R 

s 

I 

I 

I 

i 

£ 

9m 

I 

s 

S 


Li 

\ 

&  I 


57 


and  the  operating  regions  are  difficult  to  estimate  without  running  a  circuit  simulation, 
and  therefore  these  factors  are  ignored  in  the  Type  2  estimates.  However,  the  number 
and  types  of  models  are  known  from  the  subcircuit  partitioning.  Since  the  number  of 
circuit  elements  in  a  subcircuit  is  typically  proportional  to  the  number  of  nodes,  the 
number  of  nodes  in  a  subcircuit  is  used  as  a  measure  of  the  computation  time  for  each 
task  which  is  an  instance  of  the  subcircuit.  Therefore.  wr  is  set  equal  to  the  number  of 
nodes  in  the  subcircuit  corresponding  to  task  x .  The  estimate  could  be  further  refined 
by  directly  considering  the  number  and  types  of  circuit  elements  in  each  subcircuit. 

The  Type  2  estimates  are  presented  for  the  benchmark  circuits  in  Table  4.3.  The 
nonuniformity  of  task  sizes  has  a  significant  effect  compared  to  the  results  of  the  Type 
1  estimates.  The  impact  is  especially  severe  in  the  Gauss-Jacobi  case,  where  the  speedup 
is  determined  by  the  fraction  of  circuit  nodes  which  appear  in  the  largest  subcircuit. 
For  example,  in  the  ben2k  circuit.  1/7.2  of  the  total  circuit  nodes  are  contained  in  the 
largest  subcircuit,  even  though  there  are  over  100  subcircuits,  and  this  limits  the 
speedup  to  7.2.  Type  2  estimates  are  also  insensitive  to  whether  or  not  the  T  aug¬ 
mented  task  graphs  are  used  in  the  case  of  bidirectionally  coupled  circuits. 


Table  4.3.  Type  2  Speedup  Estimates 


Circuit 

Gauss-Seidel  Iterations 
12  4  8 

Gauss-Jacobi  Iterations 
1,  2,  4,  or  8 

dvs 

1.5 

1.5 

1.5 

1.5 

6.3 

dpla 

1.6 

1.9 

2.1 

2.1 

6.2 

scdac 

3.4 

3.9 

4.1 

4.3 

21.4 

ben2k 

1.6 

1.8 

1.8 

1.9 

7.2 

digfi 

3.5 

3.8 

3.9 

4.0 

37.8 

4JJ  Type  3  estimates  unsynchronized  levels 

The  enforcement  of  a  synchronization  between  each  level,  as  assumed  in  the  Type 

2  estimate,  has  a  negative  impact  on  the  speedup.  To  determine  the  extent  of  this 
impact,  the  Type  3  speedup  estimates  retain  the  nonuniform  task  time  estimates  of 
Type  2.  while  removing  the  synchronization  of  levels.  In  the  Type  3  model,  each  task  is 
assumed  to  begin  executing  immediately  after  all  its  predecessors  in  the  task  graph  are 
finished.  If  P  is  a  path  in  the  task  graph  such  that  no  other  path  has  a  larger  sum  of 
estimated  task  times,  then  the  parallel  execution  time  is  determined  by  F .  and  the  Type 

3  estimate  is  given  by 
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Table  4.4  shows  the  results  of  this  estimation  procedure,  assuming  the  unaug¬ 
mented  task  graphs  are  used.  As  expected,  the  speedups  are  greater  than  those  predicted 
by  the  Type  2  estimates.  For  these  examples,  the  penalty  for  synchronizing  the  levels  is 
up  to  a  factor  of  2.  and  in  most  cases  is  considerably  less  than  2.  In  the  Gauss-Jacobi 
case,  the  speedup  for  an  even  number  of  iterations  is  determined  by  the  two  adjacent 
subcircuits  whose  node  sum  is  largest,  because  there  is  a  path  in  the  task  graph  which 


Table  4.4.  Type  3  Speedup  Estimates:  Unaugmented  Task  Graph 


Circuit 


Gauss-Seidel  Iterations 
12  4  8 

Gauss-Jacobi  Iterations 

1  2.  4.  or  8 

1.5 

1.5 

1.5 

1.5 

6.3 

7.7 

1.8 

2.0 

2.2 

2.3 

6.2 

10.2 

5.4 

7.0 

8.1 

8.8 

21.4 

25.0 

1.9 

2.3 

2.5 

2.6 

7.2 

9.0 

4.5 

5.0 

5.3 

5.4 

37.8 

54.0 

59 


alternates  between  these  2  subcircuits  on  alternate  iterations.  For  example,  in  the  ben2k 
circuit,  there  are  two  adjacent  subcircuits  whose  node  sum  is  2/9  of  the  total  circuit 
nodes,  resulting  in  a  speedup  of  9. 

Type  1  and  Type  2  estimates  are  independent  of  whether  the  unaugmented  task 
graphs  T  or  the  augmented  task  graphs  T  are  used,  because  the  levels  are  synchronized. 
In  the  Type  3  estimate,  the  impact  of  the  extra  constraints  of  the  augmented  graph  T 
can  be  observed  in  the  Gauss-Jacobi  results  of  Table  4.5.  In  this  case  the  Gauss-Jacobi 
completion  time  is  determined  entirely  by  the  largest  subcircuit,  since  the  task  graph 
contains  a  path  including  each  instance  of  the  largest  subcircuit.  Since  all  the  circuits 
are  bidirectionally  coupled,  the  extra  constraints  of  Tos  have  no  effect  compared  to  the 
unaugmented  task  graph,  as  proved  in  Chapter  2. 

4.3.4  Normalization  to  Gauss-Seidel 

The  unnormalized  speedup  estimates  considered  above  fail  to  demonstrate  whether 
Gauss-Seidel  or  Gauss-Jacobi  will  be  faster,  because  the  ratio  rGS  Jr0]  j  is  unknown. 
The  benchmark  circuits  were  simulated  using  both  the  Gauss-Seidel  and  Gauss-Jacobi 
methods  on  a  uniprocessor,  to  obtain  measurements  of  rcs  l  and  rGJ  j.  The  ratios  are 
presented  in  Table  4.6.  For  all  the  benchmark  circuits,  the  ratios  are  very  close  to  0.7. 


Table  4.6.  Ratios  of  Gauss-Seidel  to  Gauss- Jacobi  Uniprocessor  Run  Times 


Circuit 

Ratio 

dvs 

0.7 

dpla 

0.8 

scdac 

0.7 

ben2k 

0.6 

digfi 

0.7 

This  suggests  that  it  may  be  appropriate  to  use  the  constant  0.7  as  an  estimate  of 
tgs  j/tg;  j  in  presimulation  normalized  speedup  estimates.  Multiplying  the  Gauss- 
Jacobi  speedup  estimates  by  0.7  results  in  normalized  speedups  which  are  still  consider¬ 
ably  larger  than  the  Gauss-Seidel  speedups.  In  Chapter  5,  the  normalized  speedup  esti¬ 
mates  are  applied  to  the  problem  of  selecting  between  the  Gauss-Seidel  and  Gauss-Jacobi 
methods  prior  to  performing  a  circuit  simulation. 

43.5  Time  point  pipelining  estimates 

The  parallel  performance  of  time  point  pipelining  is  a  strong  function  of  the 
number  of  time  points  in  each  window  and  the  positions  of  the  time  points  on  the  time 
axis.  This  information  is  not  available  prior  to  performing  the  circuit  simulation. 
However,  for  the  purpose  of  predicting  which  of  the  4  parallel  waveform  relaxation 
methods  is  fastest  for  a  given  circuit  and  a  given  number  of  processors,  it  is  not  neces¬ 
sary  to  actually  predict  the  time  point  pipelining  speedup.  A  technique  for  selecting  the 
fastest  of  the  4  methods  prior  to  simulation,  based  on  the  speedup  estimates  for  the  full 
window  technique,  is  presented  in  Chapter  6. 

4.4  Post-simulation  Estimates 

Even  the  most  sophisticated  of  the  presimulation  estimates  suffers  from  the  fol¬ 


lowing  limitations: 


(1)  Differences  in  the  amount  of  signal  activity  in  different  subcircuits  affect  the 
task  execution  times,  but  are  not  considered  in  the  estimates. 

(2)  The  Gauss-Seidel/Gauss-Jacobi  normalization  factor  can  only  be  guessed  based 
on  previous  experience. 

(3)  Details  on  the  number  and  locations  of  time  points,  which  affect  time  point 
pipelining  performance,  are  unknown  prior  to  simulation. 

(4)  Multiprocessing  overhead  factors  such  as  task  scheduling,  data  communications, 
and  lime  spent  waiting  for  access  to  shared  resources  are  neglected. 

All  of  these  limitations  except  the  last  can  be  overcome  by  performing  a  simulation  of 
the  circuit  on  a  uniprocessor  prior  to  computing  the  estimate,  and  using  detailed  infor¬ 
mation  on  which  tasks  are  executed  and  their  individual  computation  times. 

Accurate  post-simulation  estimates  neglecting  overhead  have  two  applications. 
The  speedup  on  k  processors  can  be  estimated  even  if  a  multiprocessor  with  k  proces¬ 
sors  is  not  available.  This  allows  projections  to  be  made  of  the  potential  performance  of 
the  algorithms  on  machines  which  will  become  available  in  the  future.  The  other  appli¬ 
cation  is  that  of  determining  the  degree  to  which  multiprocessing  overhead  is  responsi¬ 
ble  for  the  performance  of  the  algorithms  as  measured  in  actual  multiprocessor  runs. 
For  example,  if  an  algorithm  has  a  speedup  of  only  2  on  8  processors,  then  either  the 
processors  are  idle  most  of  the  time  due  to  a  lack  of  work  which  can  be  done  con¬ 
currently.  or  the  multiprocessing  overhead  is  excessive.  If  the  post-simulation  estimate 
of  speedup  neglecting  overhead  is  7.  then  one  would  conclude  that  the  problem  is  one  of 
overhead  rather  than  a  lack  of  available  parallelism. 

Unlike  the  presimulation  estimation  techniques,  post-simulation  estimates  are  not 
suitable  to  serve  as  a  guide  in  selecting  the  fastest  algorithm  for  a  given  circuit  and  a 
given  number  of  processors  prior  to  simulating  the  circuit  on  a  multiprocessor.  Since  a 
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simulation  of  tbe  circuit  is  required  prior  to  computing  the  post-simulation  estimate, 
the  need  for  running  tbe  simulation  mi  a  multiprocessor  after  computing  the  estimate  is 
obviated. 

44.1  PARASITE  description 

The  PARASITE  program  was  developed  to  compute  accurate  post-simulation  esti¬ 
mates  of  the  parallel  execution  time  of  any  of  the  four  parallel  waveform  relaxation 
algorithms,  for  a  given  circuit  on  a  given  number  of  processors,  neglecting  multiprocess¬ 
ing  overhead.  The  organization  of  the  PARASITE  system  is  depicted  in  Fig.  4.1.  First  a 
circuit  simulation  is  performed  on  the  circuit  of  interest  using  a  uniprocessor  waveform 
relaxation  program  which  has  been  modified  to  produce  two  special  output  files  for  the 
PARASITE  program.  The  first  file  contains  the  subcircuit  graph,  which  PARASITE  uses 
to  construct  task  graphs.  The  second  file  contains  information  on  each  task  that  was 
executed.  For  the  full  window  technique,  this  file  contains  for  each  subcircuit  evalua¬ 
tion  task  (*.  t )  the  measured  computation  time  of  the  task.  r(k.  i ).  For  time  point 
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Figure  4.1.  PARASITE:  Parallel  simulation  time  estimator. 
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pipelining,  the  file  contains  a  record  of  each  time  point  subtask  (k.  i.  t ).  its  measured 
computation  time  rtk.i.t ).  and  the  corresponding  initial  location  of  the  time  point 
tmn  (k.i.t). 

After  the  circuit  simulation  is  complete,  the  PARASITE  program  is  run.  also  on  a 
uniprocessor.  The  PARASITE  program  mimics  the  operation  of  the  specified  parallel 
waveform  relaxation  algorithm,  but  instead  of  performing  computations  to  solve  the 
circuit  equations,  it  just  keeps  track  of  the  time  that  would  be  required  to  execute  the 
tasks  on  a  specified  number  of  processors.  The  windows  are  processed  sequentially. 
Within  each  window  a  weighted  task  graph  T  is  constructed  for  either  the  full  window 
technique  or  for  time  point  pipelining,  as  specified.  The  number  of  iterations  in  the  task 
graph  is  known  from  the  record  of  tasks  which  were  executed  in  the  circuit  simulation. 
All  the  other  information  needed  to  construct  the  task  graph  is  contained  in  the  subcir¬ 
cuit  graph  in  the  case  of  the  full  window  technique,  and  in  the  subcircuit  graph  and 
time  point  information  in  the  case  of  time  point  pipelining.  The  weights  are  obtained 
directly  from  the  measured  CPU  times  of  the  individual  tasks  or  subtasks. 

PARASITE  then  simulates  the  parallel  execution  of  the  tasks  in  the  graph  on 
Nfrxs  processors,  using  the  algorithm  given  below.  In  the  PARASITE  algorithm,  t  is 
the  estimated  elapsed  time,  f,  is  the  time  at  which  processor  i  will  finish  its  current 
task,  x,  is  the  task  assigned  to  processor  i .  w  (x  )  is  the  weight  of  task  x  .  P  is  the  set  of 
active  processors,  and  P  is  the  set  of  idle  processors.  A  queue  is  used  to  hold  tasks  that 
are  ready  to  execute  but  that  have  not  yet  been  assigned  to  processors  for  execution. 
The  function  qfHt  (tsk  )  puts  task  tsk  on  the  queue,  and  qempty  ()  returns  TRUE  if  the 
queue  is  empty.  The  function  qget  ()  returns  a  task  from  the  queue  and  deletes  it  from 
the  queue.  The  usk  obtained  by  qget  ()  is  the  task  of  the  lowest  available  iteration 
number  which  has  been  on  the  queue  for  the  longest  time.  The  initial  tasks  are  those 
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which  have  aa  in-degree  of  0  tit  the  task  graph. 

Algorithm  4.1.  PARASITE:  Simulate  Parallel  Execution  .  T  ) 

Huj*,*) 

*•*»•*»•  ‘ ' '  *~Q 

F*“ 

for  each  (initial  task  x  )  f/ur  (x  ) 
repeat  { 

/*  Assign  tasks  to  processors  */ 

for  each  (i  €?)  { 

if  (qcmpty  () •FALSE  )  { 

*i  -«f*  O 

tj  —t  +H>  (Xi  ) 

add  i  to  set  P 
remove  i  from  set  P 

\ 

) 

/*  Advance  time  V 
find  t  €P  with  smallest 

for  each  (successor  y  of  x,  in  T  with  no  other  predecessor)  qput  (y  ) 
remove  Xj  from  graph  T 
remove  *  from  set  P 
add  i  to  set  P 

}  until  (qempty  OmTRUE  and  P  *d) 

Algorithm  4.1  is  used  within  each  window,  and  the  total  estimated  parallel  execution 
time  is  the  sum  of  the  times  in  all  the  windows. 

4^4.2  Limitations 

The  estimates  produced  by  PARASITE  are  considerably  more  accurate  than  the 
presimulation  estimates,  but  certain  limitations  should  be  noted.  The  fact  that 
PARASITE  does  not  account  for  parallel  processing  overhead  has  been  previously  dis¬ 
cussed.  and  this  feature  can  be  viewed  as  either  an  advantage  or  disadvantage,  depend¬ 
ing  on  the  intended  application.  A  speedup  computed  from  a  PARASITE  time  estimate 
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if  «a  ipproiiotu  upper  bound  on  speedup,  subject  to  certain  conditions.  It  is  an  upper 
bound  because  multiprocessing  overhead  is  neglected,  thereby  resulting  in  an  overesti¬ 
mate  of  speedup.  It  is  approximate  because  PARASITE  does  not  compute  the  optimum 
schedule  of  the  tasks  on  the  parallel  processors.  The  estimate  is  subject  to  certain  con¬ 
ditions  arising  from  the  assumption  that  the  multiprocessing  run  executes  exactly  the 
same  tasks  as  the  uniprocessor  case. 

PARASITE  queues  tasks  as  soon  as  their  precedence  constraints  are  satisfied,  and 
obtains  a  task  from  the  queue  as  soon  as  a  processor  becomes  available  to  execute  a  new 
task.  This  results  in  an  optimum  scheduling  of  tasks  for  those  cases  where  the  number 
of  processors  is  either  1  or  infinite.  For  intermediate  numbers  of  processors,  the  task 
scheduling  may  be  nonoptimal.  When  more  than  one  task  is  on  the  queue,  then  the 
choice  of  which  task  to  execute  next  can  affect  the  overall  parallel  execution  time.  Even 
if  a  multiprocessor  waveform  relaxation  program  uses  the  same  policy  of  obtaining 
work  from  the  queue  on  a  first-in-first-out  basis,  small  delays  introduced  by  overhead 
in  the  multiprocessor  run  can  result  in  a  different  order  of  execution  and  a  different 
assignment  of  tasks  to  processors.  Normally  the  overhead-induced  delays  will  result  in 
a  longer  run  time  than  the  PARASITE  estimate,  but  it  is  possible  for  the  delays  to  cause 
a  reduction  in  the  run  time,  if  by  chance  a  more  efficient  scheduling  of  tasks  on  proces¬ 
sors  results.  PARASITE  could  be  modified  to  determine  the  optimum  schedule  of  tasks, 
but  this  optimization  problem  is  NP-complete  and  would  be  too  time  consuming 
[Gar 79}.  In  practice,  the  disruption  in  execution  order  caused  by  overhead  delays  is 
unlikely  to  result  in  a  significantly  different  schedule,  provided  the  overhead  is  small 
compared  to  the  task  execution  times. 

The  fact  that  PARASITE  assumes  that  the  same  tasks  are  executed  in  the  unipro¬ 
cessor  and  multiprocessor  cases  has  several  implications.  The  study  of  asynchronous 
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relaxation  met  bods  [Cha69.  Bau78].  in  which  the  task  graph  evolves  during  the  execu¬ 
tion  of  the  relaxation  process,  is  not  possible  with  PARASITE,  because  the  computations 
performed  can  depend  on  the  number  of  processors.  The  parallel  versions  of  the  strict 
Gavim-Seiriel  and  Gaum- Jacobi  relaxation  methods  do  not  present  a  problem  for 
PARASITE  because  the  precedence  constraints  between  tasks  are  designed  to  assure  that 
the  computations  in  the  parallel  case  operate  on  exactly  the  same  data  as  in  the  unipro¬ 
cessor  cam. 

The  PARASITE  results  apply  only  to  the  particular  circuit,  subcircuit  partitioning, 
subcircuit  ordering,  window  boundaries,  and  relaxation  method  used  in  the  reference 
uniprocessor  simulation.  Each  of  them  factors  can  affect  the  uniprocessor  run  time  and 
the  degree  of  parallelism. 

4A3  Results 

PARASITE  was  applied  to  the  benchmark  circuits,  producing  the  unnormalized 
speedup  estimates  in  Table  4.7.  The  PARASITE  results  for  the  full  window,  unlimited 
processor  case  agree  closely  with  some  of  the  Type  3  estimates  for  the  4-iteration  cam. 
and  are  as  much  as  2  times  smeller  in  some  of  the  other  esses.  The  discrepancies  are 
greatest  for  the  sedac  and  digji  circuits  which  are  comparatively  large,  evenly  parti¬ 
tioned  circuits.  Them  are  the  circuits  that  are  most  likely  to  have  a  large  number  of 
subcircuits  with  comparatively  slowly  changing  signals  in  any  given  window.  The 
variable  time  step  integration  algorithm  will  choose  very  long  time  steps  and  will  con¬ 
sequently  compute  very  few  time  points  in  these  subcircuits,  whereas  the  subcircuits 
with  rapidly  rhsngmg  signals  will  require  many  time  points.  Consequently,  the  subcir¬ 
cuits  with  low  signal  activity  cause  a  reduction  in  the  available  parallelism  as  compared 
with  the  Type  3  estimates.  It  should  be  noted  that  even  though  subcircuits  with  low 


Algorithm  Circuit 


FWT-GS 


TPP-GS 


FWT-GJ 


TPP-GJ 


FWT-GJ 


TPP-GJ 


1  2 


1.0  1.4 


1.0  2.0 

1.0  2.0 


Processors 
4  8  16 


1J  1J  1.5 

2.1  2.1  2.1 

3.7  5.3  5.6 

2.1  2.3  2.4 

3.3  3.8  3.9 


2.3  2.4  2.4 

2.8  3.0  3.0 

3.9  7.0  9.4 

2.8  3.1  3.2 

3.9  5.8  6.1 


3.5  5.3  6.2 

3.8  6.6  8.1 

3.9  7.2  11.0 

3.8  7.2  9.9 

3.9  7.6  13.7 


1 

3.9  7.1  10.6 
3.9  7.5  13.0 
3.9  7.4  11.7 
4.0  7.8  14.5 


3.1  4.3  5.0 

3.7  5.7  6.3 

3.8  6.4  .8.0 

3.9  7.0  9.2 

3.9  7.4  12.0 


1.0 

2.0 

3.8 

6.7 

1.0 

2.0 

3.9 

6.7 

1.0 

2.0 

3.9 

7.4 

1.0 

2.0 

3.9 

7.4 

1.0 

2.0 

4.0 

7.7 

8.6 

8.8 

8.8 

9.2 

9.9 

9.9 

11.1 

13.2 

13.8 

11.5 

13.1 

13.6 

14.2 

21.2 

25.6 

signal  activity  reduce  parallelism,  they  do  so  only  because  the  uniprocessor  waveform 
relaxation  algorithm  already  takes  advantage  of  this  situation  by  not  computing 
unnecessary  time  points  in  these  subcircuits. 

The  PARASITE  results  are  tabulated  as  a  function  of  the  number  of  processors. 
For  the  Gauss- Seidel  method  with  the  full  window  technique,  the  speedups  reach  their 


maximum  values  very  quickly  as  the  number  of  processors  are  increased  due  to  the 
severely  limited  parallelism.  The  Gauss-Jacobi  method  with  the  full  window  technique 
exhibits  speedups  close  to  the  number  of  processors  until  about  8  processors,  depending 
on  the  circuit. 

The  time  point  pipelining  speedup  estimates  are  necessarily  greater  than  the 
corresponding  full  window  speedups.  since  time  point  pipelining  exposes  greater  paral¬ 
lelism.  and  overhead  is  neglected.  The  degree  by  which  the  time  point  pipelining  speed- 
ups  are  greater  than  the  full  window  technique  speedups  depends  directly  on  the  the 
window  sizes.  The  windows  are  chosen  by  the  automatic  windowing  algorithm  of 
RELAX2.3.  which  tends  to  favor  small  windows  in  order  to  keep  the  number  of  itera¬ 
tions  small.  Larger  windows  would  result  in  greater  time  point  pipelining  speedups 
compared  to  the  full  window  technique  using  the  same  enlarged  windows;  but  very 
large  windows  would  cause  the  1-processor  reference  time  to  be  increased  due  to  an 
increase  in  the  number  of  iterations. 

Since  the  uniprocessor  run  times  are  known  in  computing  post-simulation  esti¬ 
mates.  it  is  possible  to  normalize  Gauss-Jacobi  speedups  to  Gauss-Seidel.  as  shown  in 
Table  4.8.  The  speedups  less  than  1  for  the  single  processor  case  reflect  the  normaliza¬ 
tion  factor.  Even  though  Gauss-Jacobi  starts  out  with  this  speed  disadvantage  on  1  pro¬ 
cessor.  the  normalized  Gauss-Jacobi  estimates  surpass  the  corresponding  Gauss-Seidel 
estimates  when  the  number  of  processors  is  sufficiently  large,  for  both  the  full  window 
technique  and  time  point  pipelining.  Consequently,  the  Gauss-Jacobi  method  with  time 
point  pipelining  offers  the  greatest  potential  speed  of  the  four  algorithms  when  the 
number  of  processors  is  large. 

The  break-even  point  between  the  Gauss-Seidel  and  Gauss-Jacobi  methods  is  a 
function  of  the  circuit  being  simulated.  Table  4.9  shows  which  of  the  relaxation 


Table  4.8.  PARASITE  Speedup  Estimates  Normalized  to  Gauss-Seidel 
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4.5 
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3.0 
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digit 

0.7 
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9.6 

9.6 
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0.7 

1.4 
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5.3 

9.2 

12.5 

13.6 

ben2k 
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1.3 

2.3 

4.7 

7.4 

8.5 

8.9 
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0.7 

1.4 
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5.4 

10.1 

16.9 
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6.3 
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7.3 
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0.7 

1.4 

2.8 

5.2 

7.8 

9.3 

9.8 

ben2k 

0.6 

1.3 

2.5 

4.7 

7.3 

8.3 

8.6 
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0.7 

1.4 

2.8 

5.4 

9.9 
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Table  4.9.  Fastest  Method  Based  on  PARASITE  Using 
Unaugmented  Task  Graphs  and  Time  Point  Pipelining 


Circuit 
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1  2  4  8  16  32  ee 
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S  S  J  J  J  J  J 
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S  S  S  S  S  J  J 

S  S  S  J  J  J  J 

S  S  S  S  J  J  J 

Key:  S-Gauss-Seidel.  J-Gauss- Jacobi 

methods  is  faster  for  each  circuit,  using  time  point  pipelining,  as  a  function  of  the 
number  of  processors,  based  on  the  PARASITE  estimates.  Typically,  larger  circuits 
have  a  higher  break-even  point  because  these  circuits  have  more  subcircuits  and  greater 
Gauss-Seidel  parallelism.  But  as  demonstrated  by  the  scdac  circuit,  circuit  size  alone  is 


70 


I 


not  a  sufic icnl  indicator  of  the  break-even  point.  The  Gauss- Jacobi  method  becomes 
preferable  only  when  the  Gauss-Seidel  method  does  not  produce  enough  concurrent 
work  for  the  available  supply  of  processors. 

Table  4.10  summarizes  the  impact  of  the  extra  constraints  of  the  augmented  task 
graph  on  the  Gauss-Jacobi  speedups  predicted  by  PARASITE.  Since  all  the  benchmark 
circuits  are  bidirectionally  coupled,  the  loss  in  speedup  resulting  from  the  use  of  the 
augmented  constraints  in  the  full  window  case  on  unlimited  processors  is  at  most  50%. 
as  proved  in  Chapter  2.  The  results  in  the  table  are  in  agreement  with  this  theoretical 
result.  If  only  a  limited  number  of  processors  are  available,  then  the  effect  of  the  aug¬ 
mented  graph  may  be  greatly  diminished,  depending  on  the  size  and  structure  of  the  cir¬ 
cuit.  In  the  full  window  case,  the  larger  circuits  ben2k  and  digji  show  less  than  a  5% 
degradation  in  speedup  on  8  processors,  because  there  is  sufficient  work  to  keep  the  pro¬ 
cessors  fairly  busy  even  when  the  extra  constraints  are  added.  The  smallest  circuit  dvs 
has  very  little  parallelism,  and  the  full  impact  of  the  extra  constraints  is  evident  on  8 
processors  in  the  full  window  case.  The  time  point  pipelining  speedups  are  affected  to  a 

Table  4.10.  PARASITE  Estimated  Speedup  Loss:  fG}  vs.  Ta} 
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lesser  degree  by  the  extra  constraints.  This  is  not  surprising,  due  to  the  parallelism 
which  the  pipelining  affords  between  adjacent  tasks  in  the  task  graph.  On  8  processors 
the  decrease  in  speedup  is  no  greater  than  5%  for  even  the  small  circuits,  due  to  the 
availability  of  many  concurrently  executable  tasks,  even  with  the  added  constraints. 


CHAPTER  5 


FULL  WINDOW  TECHNIQUE  IMPLEMENTATION 


The  choice  of  Gauss-Jacobi  or  Gauss-Seidel  relaxation  and  the  choice  of  the  full 
window  technique  or  time  point  pipelining  both  represent  tradeoffs  involving  parallel¬ 
ism  and  parallel  processing  overhead.  Parallel  processing  overhead  is  intimately  related 
to  the  multiprocessor  architecture,  the  parallel  algorithm,  and  the  details  of  the  imple¬ 
mentation  of  the  algorithm  on  the  multiprocessor.  To  study  the  performance  of  the 
different  algorithms  in  an  actual  parallel  processing  environment,  the  algorithms  were 
implemented  in  programs  which  run  on  an  Alliant  FX/8  multiprocessor.  The  FWT  pro¬ 
gram.  which  is  described  in  this  chapter,  is  an  implementation  of  the  full  window  tech¬ 
nique.  The  TPP  program  embodies  the  time  point  pipelining  algorithm  and  is  described 
in  Chapter  7.  Since  TPP  was  derived  from  FWT.  much  of  the  discussion  in  this  chapter 
applies  also  to  TPP. 

The  Alliant  FX/8  hardware  and  software  environment  is  briefly  described  in  the 
following  section.  Next,  the  locking  mechanism  for  protecting  the  integrity  of  shared 
data,  and  the  task  system  which  controls  the  parallel  execution  of  tasks  are  described: 
these  features  are  common  to  both  FWT  and  TPP.  The  implementation  of  parallel  pro¬ 
cessing  in  FWT  is  then  described.  Finally,  performance  measurements  are  given  and 
compared  with  the  estimates  produced  by  PARASITE. 

5.1  The  Multiprocessor 

The  Alliant  FX/8  is  an  8-processor  mini-supercomputer  [A1186a].  Each  processor 
is  capable  of  executing  scalar  and  vector  instructions,  with  a  peck  rate  of  5.9  Mflops 


when  operating  on  64-bit  floating  point  operands.  Each  processor  contains  vector  and 
scalar  registers  and  an  instruction  cache.  All  the  processors  share  a  common  main 
memory  system  which  is  accessed  through  a  shared  cache,  as  shown  in  Fig.  5.1. 
Although  the  Alliant  machine  is  limited  to  8  processors,  the  Cedar  multiprocessor 
supercomputer,  currently  under  development  at  the  Center  for  Supercomputing 
Research  and  Development  at  the  University  of  Illinois,  will  consist  of  multiple  Alli- 
ants  interconnected  through  a  global  shared  memory  [Kuc86].  Consequently,  the 
results  presented  in  this  chapter  for  actual  multiprocessor  runs  are  for  8  or  fewer  pro¬ 
cessors.  and  the  speedup  estimates  of  previous  chapters  will  be  used  for  predicting  the 
perf ormance  on  f uture  machines  with  more  processors. 

The  Alliant  computer  runs  a  UNIX  operating  system  and  has  a  FORTRAN  com¬ 
piler  [A11S5]  which  automatically  vectorizes  and  parallelizes  DO  loops,  based  on  its 


Figure  5.1.  Alliant  FX/8  architecture. 


analysis  of  data  dependencies.  A  compiler  for  the  C  language  [A1186b]  is  also  available, 
but  it  does  not  perform  automatic  vectorization  and  parallelization.  The  FWT  and  TPP 
programs  are  derivatives  of  RELAX2.3,  and  all  3  of  the  programs  are  written  in  C.  As 
a  result,  these  programs  are  not  automatically  vectorized  and  parallelized  by  the  com¬ 
piler.  However,  even  if  the  automatic  parallelizing  features  of  the  FORTRAN  compiler 
were  available  in  the  C  compiler,  the  natural  parallelism  of  waveform  relaxation  would 
not  be  recognized  by  the  compiler. 

Suppose  that  the  C  compiler  included  the  automatic  parallelization  features,  and 
that  the  main  program  of  the  uniprocessor  waveform  relaxation  program  is  a  realization 
of  Algorithm  2.1.  in  which  the  subcircuit  evaluation  task  is  realized  with  a  subroutine 
call.  When  the  subroutine  is  called  it  is  passed  a  pointer  to  the  data  structure  for  the 
subcircuit,  and  this  structure  contains  pointers  to  other  subcircuits  which  supply  its 
input  waveforms.  When  the  main  program  is  compiled,  the  compiler  is  not  aware  of 
how  all  these  pointers  will  be  used  when  the  subroutine  is  executed,  and  the  pointers 
are  not  even  known  at  compile  time  because  they  are  circuit  dependent.  Therefore,  the 
compiler  would  decide  to  execute  the  subcircuit  evaluation  tasks  serially,  because  the 
potential  exists  for  arbitrary  data  interdependencies  between  the  tasks.  The  interdepen¬ 
dencies  could  result  in  improper  operation  if  the  tasks  are  allowed  to  execute  con¬ 
currently. 

A  FORTRAN  compiler  directive  is  provided  which  allows  the  different  iterations 
of  a  loop  to  be  performed  concurrently  even  if  the  loop  contains  a  subroutine  call.  This 
could  be  used  directly  for  the  subcircuit  loop  in  Algorithm  2.1  when  Gauss-Jacobi  is 
used.  However,  for  Gauss-Seidel.  the  partial  ordering  of  subcircuits  within  an  iteration, 
as  dictated  by  the  task  graph,  must  be  enforced  These  constraints  are  not  readily 
recognized  by  the  compiler  because  they  arise  from  the  interactions  of  executable  state- 


ments  and  data  structures  which  are  built  dynamically  during  program  execution. 
Consequently,  the  programmer  must  assume  responsibility  for  setting  up  a  mechanism 
to  control  the  parallel  execution  of  waveform  relaxation  tasks. 

The  concurrent  use  of  multiple  processors  is  implemented  in  the  C  environment 
with  the  concurrent  call  feature.  In  a  concurrent  call,  a  C  function  is  called  a  specified 
number  of  times.  M .  Each  of  the  M  invocations  is  assigned  a  unique  index  number  and 
runs  on  a  separate  processor,  provided  M  ^8.  Since  the  function  can  test  its  index 
number  and  perform  different  actions  based  on  its  value,  the  processors  may  execute 
any  independent  instruction  streams.  When  all  of  the  concurrent  function  invocations 
terminate,  control  returns  to  the  calling  program  which  then  continues  running  on  a 
single  processor. 

Vector  instructions  are  accessible  from  C  programs  through  calls  to  library  func¬ 
tions.  However,  vector  instructions  are  not  used  in  FWT  and  TPP.  because  most  subcir¬ 
cuit  are  small  in  size  and  yield  small  vector  lengths.  In  previous  work,  vector  process¬ 
ing  techniques  have  been  used  with  the  standard  direct  method  algorithms  applied  to 
the  entire  circuit,  in  which  case  the  potential  for  longer  vector  lengths  exists.  Even  in 
this  more  favorable  situation,  the  vector  parallelism  is  limited  due  to  the  high  degree  of 
sparsity  in  circuit  matrices,  the  irregular  structure  of  the  matrix,  and  the  different 
equations  which  must  be  evaluated  in  different  operating  regions  of  the  transistors. 

The  main  memory  of  the  Alliant  FX/8  is  shared  by  all  the  processors.  One  proces¬ 
sor  may  communicate  with  another  simply  by  writing  to  a  memory  location  which  is 
later  read  by  the  other  processor.  It  is  the  programmer  s  responsibility  to  assure  that 
one  processor  does  not  corrupt  the  data  needed  by  another  processor,  and  that  one  pro¬ 
cessor  does  not  try  to  read  data  before  the  data  are  written  by  another  processor.  Data 
accesses  by  different  processors  to  common  memory  locations  can  be  synchronized  using 
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either  global  or  local  synchronization  techniques. 

The  global  synchronization  technique  is  easiest  to  implement.  A  set  of  indepen¬ 
dent  tasks  are  allowed  to  execute  concurrently.  Tasks  which  require  input  data  from 
other  tasks  in  the  set  are  excluded  from  the  set.  When  all  the  tasks  are  finished,  a  new 
set  of  independent  tasks  are  started.  Each  set  can  consist  of  the  tasks  in  one  level  of  a 
task  graph.  However,  performing  global  synchronizations  between  each  level  of  the 
task  graph  does  not  exploit  the  full  parallelism  of  the  graph,  as  observed  in  the  results 
of  the  Type  1  and  Type  2  estimates  of  Chapter  4. 

The  local  synchronization  approach  allows  data  to  be  exchanged  between  two  pro¬ 
cessors  without  waiting  for  the  other  processors  to  reach  a  global  synchronization  point. 
Local  synchronizations  allow  for  the  globally  unsynchronized  execution  of  a  task  graph 
as  assumed  in  the  Type  3  and  PARASITE  estimates  of  Chapter  4.  Local  synchroniza¬ 
tions  are  accomplished  on  the  Alliant  using  locks,  which  are  based  on  the  atomic  lest- 
and-set  instruction. 

S.2  Locks 

Locks  are  typically  used  to  protect  shared  data  during  critical  portions  of  time 
during  which  the  data  may  only  be  accessed  by  one  processor.  For  example,  in  Algo¬ 
rithm  2.2  the  global  variable  unconvk  is  accessed  3  limes.  Its  value  is  read  from 
memory,  and  after  the  value  is  decremented  it  is  written  back  to  memory.  Then  it  is 
read  again  to  test  if  its  value  is  0.  In  order  to  guarantee  the  proper  operation  of  this 
algorithm,  no  other  processor  executing  the  algorithm  concurrently  may  access  unconvi 
during  the  time  interval  spanned  by  these  3  accesses.  This  rule  can  be  enforced  by  asso¬ 
ciating  a  lock  with  the  variable  unconvk  .  and  by  locking  the  lock  before  the  first  access 
and  unlocking  the  lock  after  the  last  of  the  3  accesses.  If  a  processor  tries  to  lock  a  lock 
which  is  already  locked,  it  will  be  forced  to  wail  until  the  lock  is  released  by  the  other 


processor  before  it  is  allowed  to  acquire  the  lock  and  continue. 

Locks  are  implemented  with  the  atomic  test-and-set  machine  instruction.  This 
instruction  is  not  directly  accessible  in  a  C  language  statement,  but  may  be  accessed 
through  a  function  call  or  through  embedded  assembly  language  instructions.  The  FWT 
and  TPP  programs  set  locks  using  a  LOCK  macro,  which  is  translated  into  either  a  func¬ 
tion  call  or  assembly  language  instructions  prior  to  compilation.  In  either  case,  the 
LOCK  macro  causes  the  execution  of  the  following  algorithm,  in  which  x  is  a  lock  vari¬ 
able. 

Algorithm  5.1.  LOCK(x) 

repeat  { 

test-and-set(x  ) 
delay 

}  until  (the  test-and-set  operation  is  successful) 

The  test-and-set  operation  is  successful  only  if  x  =0.  in  which  case  it  sets  x  *-l.  The 
testing  of  the  value  of  x  and  the  setting  of  its  value  to  1  are  performed  as  an  atomic 
operation  to  prevent  two  different  processors  from  successfully  setting  the  same  lock 
simultaneously.  The  delay  is  added  for  performance  reasons  to  be  discussed  below. 
Locks  are  unlocked  in  FWT  and  TPP  by  an  UNLOCK  macro.  By  convention,  only  the 
processor  which  locked  a  lock  is  allowed  to  unlock  it.  The  following  rather  brief  algo¬ 
rithm  accomplishes  the  unlocking  chore  and  is  trivially  implemented  directly  in  C  code. 

Algorithm  5.2.  UNLOCK(x) 
x  -0 

When  one  processor  tries  to  lock  a  lock  which  is  already  locked,  it  enters  a  tight 
loop  in  which  it  repeatedly  tests  the  value  of  the  lock  until  tne  lock  is  released  by  the 
other  processor.  This  has  two  effects  on  performance.  The  most  obvious  and  important 
effect  is  that  the  waiting  processor  does  not  do  any  useful  work  while  it  is  waiting  for 
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the  lock.  A  second  effect  is  that  the  waiting  processor  generates  cache  traffic  while 
repeatedly  testing  the  lock,  and  thereby  slows  down  the  gainfully  employed  processors. 

The  cache  competition  caused  by  processors  waiting  on  locks  can  be  reduced  by 
adding  null  operations  in  the  loop  to  effect  a  delay  between  successive  accesses  of  the 
lock.  The  null  operations  are  presumably  executed  out  of  the  local  processor  instruc¬ 
tion  cache  without  competing  for  access  to  the  shared  cache.  As  the  delay  in  the  loop  is 
increased,  the  cache  traffic  is  further  reduced.  However,  increasing  the  delay  also 
increases  the  average  lime  that  a  waiting  task  will  be  delayed  after  a  lock  is  released  by 
another  processor  before  realizing  that  the  lock  is  available. 

A  locking  operation  which  uses  a  function  call  is  referred  to  as  a  normal  lock,  and 
the  case  of  embedded  assembly  language  is  referred  to  as  a  fast  lock.  A  fast  lock  which 
is  successful  on  its  first  attempt  executes  only  two  machine  instructions,  the  tesl-and- 
set  and  a  conditional  branch.  A  normal  lock  includes  the  overhead  of  the  function  call 
mechanism,  which  is  significant  compared  to  the  time  required  for  only  two  machine 
instructions.  If  the  lock  is  not  acquired  successfully  on  the  first  attempt,  the  perfor¬ 
mance  difference  between  fast  and  normal  locks  becomes  less  significant. 

Normal  locks  are  portable  and  robust  with  respect  to  revisions  in  the  compiler. 
Fast  locks  do  not  share  these  advantages  because  the  interface  between  the  C  code  and 
assembly  code  is  through  a  register  which  is  assigned  automatically  by  the  compiler. 
When  using  many  locking  operations  to  control  fine-grained  accesses,  the  function  call 
overhead  can  be  significant,  especially  if  the  lock  is  acquired  on  the  first  try  in  most 
cases,  since  then  only  two  machine  instructions  are  executed.  But.  if  the  number  of 
locking  operations  is  a  small  percentage  of  the  total  number  of  operations,  then  the  per¬ 
formance  advantage  of  fast  locks  is  insignificant.  Due  to  the  relatively  large  task 
granularity  employed  in  FWT  and  TPP,  the  use  of  fast  locks  does  not  result  in  a 
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significant  performance  improvement.  Experiments  with  finer-grained  tasks  have 
demonstrated  that  fast  locks  can  make  a  noticeable  impact  on  performance  in  such 
cases. 

A  more  fundamental  observation  concerning  locks  on  the  Alliant  compared  to 
some  other  machines  is  that  even  normal  locks  are  quite  fast.  Therefore,  when  the  level 
of  concurrency  can  be  increased  by  using  more  locks,  each  controlling  a  smaller  group 
of  shared  variables,  the  resulting  increase  in  the  time  spent  performing  locking  opera¬ 
tions  will  normally  not  be  severe.  Therefore,  the  general  guideline  applied  to  the  use  of 
locks  in  the  FWT  and  TPP  programs  has  been  to  keep  the  amount  of  code  in  locked  sec¬ 
tions  small,  even  if  this  requires  extra  locking  operations  and  extra  lock  variables. 

5.3  Task  System 

A  central  fixture  of  the  FWT  and  TPP  programs  is  the  task  system  which  queues 
tasks,  assigns  queued  tasks  to  processors,  and  terminates  the  parallel  processing  mode 
when  all  tasks  are  done.  A  central  queue  is  used  to  hold  tasks  which  are  ready  to  exe¬ 
cute.  In  some  parallel  processing  architectures,  some  memory  locations  can  be  accessed 
more  quickly  from  a  particular  processor  or  group  of  processors.  In  such  systems,  the 
assignment  of  tasks  to  processors  may  take  the  proximity  of  the  required  data  into 
account.  However,  on  the  Alliant.  all  data  in  the  cache  and  memory  are  equally  accessi¬ 
ble  from  any  processor.  Therefore,  data  proximity  considerations  are  not  appropriate  in 
the  task  scheduler  for  the  FWT  and  TPP  programs  on  the  Alliant.  When  a  processor 
becomes  available  to  begin  working  on  a  new  task,  the  next  appropriate  task  is  taken 
from  the  queue  and  assigned  to  the  processor. 

Two  complications  are  introduced  by  allowing  one  waveform  relaxation  iteration 
to  begin  before  the  preceding  iteration  is  completed.  One  problem  is  that  a  task  in  a 
later  iteration  may  become  eligible  for  execution  before  it  is  determined  that 
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convergence  occurs  on  a  preceding  iteration.  Consequently,  unnecessary  tasks  may  be 
executed,  and  these  tasks  will  use  resources  which  would  be  better  used  for  required 
tasks  associated  with  earlier  iterations.  This  problem  is  addressed  by  assigning  each 
task  a  priority  based  on  its  iteration  number.  Tasks  associated  with  lower  iteration 
numbers  are  assigned  higher  priorities.  Separate  subqueues  are  maintained  for  each 
priority  level.  When  the  task  system  obtains  a  task  from  the  queue  to  be  executed,  it 
selects  a  task  from  the  nonempty  subqueue  with  the  highest  priority. 

The  other  problem  arising  from  overlapping  iterations  is  that  convergence  might 
be  obtained  while  some  tasks  are  in  the  middle  of  execution.  For  this  reason,  and  also 
to  accommodate  some  recoverable  error  conditions  that  can  arise  during  the  simulation 
such  as  waveform  buffer  overflow,  a  facility  is  provided  to  gracefully  kill  all  tasks 
which  are  executing  or  queued  and  to  prevent  new  tasks  from  being  queued.  Executing 
tasks  periodically  check  a  flag,  and  if  it  is  set  they  terminate  with  their  associated  data 
structures  in  a  state  which  is  acceptable  for  continuing  the  simulation  after  the  global 
synchronization. 

The  operation  of  the  task  system  is  summarized  in  the  following  algorithm  out¬ 
lines.  The  global  variable  tqjcount  is  the  number  of  tasks  which  are  queued  or  are  exe¬ 
cuting.  The  lock  queue  Jock  protects  the  queue  and  tqjcount .  A  task  tsk  of  priority  p 
is  queued  by  the  following  algorithm: 

Algorithm  53.  Queue_Task  (tsk,  p) 

if  (kill  J lag  -FALSE )  { 

LOCK  ( queue  Jock  ) 
append  tsk  to  subqueue  p 
tqjcount  —tqjcount  +1 
L’NLOCKC^ueue  lock  ) 

> 


The  execution  of  tasks  on  all  parallel  processors  is  under  the  control  of  the  parallel  task 


controller,  which  is  executed  on  each  processor  using  the  concurrent  call  mechanism. 


Algorithm  3A.  Parallel_Taak_Controller — runs  on  each  processor 

tsk  —NULL 
while  (tqjcount  >0)  { 

LOCUqueueJock  ) 

if  (there  is  a  task  on  the  queue)  { 

tsk  —  (task  from  lowest  numbered  subqueue) 
remove  tsk  from  subqueue 

) 

UNLOCK($wueJoc*  ) 
if  (tsk  nNULL  )  { 
execute  tsk 
tsk  - NULL 

LOCKi  queue JLock  ) 
tqjcount  — tqjcount  —1 
UNLOCK(?ueiie  lock  ) 

) 

else  delay 


Note  that  if  the  queue  is  empty,  then  the  if  clauses  are  bypassed  and  the  operations  in 
the  loop  consist  of  repeated  accesses  to  a  few  global  variables:  queue  Jock  .  tp_count .  and 
the  location  in  the  queue  data  structure  which  indicates  that  the  queue  is  empty.  If 
several  processors  are  without  work  and  the  queue  is  empty,  this  results  in  excessive 
cache  traffic  for  these  variables,  which  interferes  with  the  operation  of  the  working  pro¬ 
cessors.  The  delay  in  the  else  clause  is  introduced  to  relieve  the  cache  congestion  in  this 
case,  in  the  same  manner  in  which  the  delay  was  introduced  in  the  LOCK  routine.  The 
value  of  the  delay  is  optimized  empirically. 

The  task  killer  is  invoked  by  any  task  which  determines  that  all  other  currently 
queued  and  executing  tasks  are  unnecessary. 


Algorithm  S.S.  Kill.Tasks 

LOCKiqueueJock  ) 
kilt J lag  -TRUE 

remove  all  tasks  from  all  subqueues 


decrement  tqjxunt  by  the  number  of  tasks  removed 
USLOCKiqueueJock  ) 

wait  until  tqjcouni  *1.  (i.e.,  until  all  other  tasks  terminate) 


5.4  RELmod 

The  starting  point  for  the  development  of  the  FWT  program  was  the  RELAX2.3 
program  developed  by  Jacob  White  [Whi86],  RELAX2.3  is  a  uniprocessor  waveform 
relaxation  program.  It  was  modified  to  produce  another  uniprocessor  program  known 
as  RELmod.  which  contains  additional  research-oriented  features.  RELmod  served  as 
the  basis  for  the  FWT  multiprocessor  waveform  relaxation  program  which  uses  the  full 
window  approach  to  parallelism.  Finally,  the  FWT  program  was  modified  to  create  the 
TPP  program  which  uses  time  point  pipelining. 

RELmod  uses  the  partitioning,  ordering,  windowing,  and  numerical  integration 
algorithms  of  RELAX2.3  without  any  substantive  changes.  The  modifications  incor¬ 
porated  into  RELmod  include  corrections  of  bugs  and  features  intended  primarily  for 
research  use.  Specifically.  RELmod  contains  these  features: 

(a)  The  Gauss-Jacobi  method,  and  hybrid  Gauss-Jacobi/Gauss-Seidel  methods  are 
implemented. 

(b)  Several  bugs  are  fixed,  the  most  notable  being  a  bug  in  the  resetting  of  the  error 
tolerances  when  suiting  a  new  window.  This  bug  had  a  significant  negative 
impact  on  the  Gauss-Jacobi  run  times  in  some  cases,  where  slow  convergence  led  to 
repeated  reductions  in  window  size  and  repeated  reductions  in  error  tolerance 
which  were  never  retracted. 

(c)  Overlapped  partitioning  of  subcircuits  is  permitted  as  an  option  [Mok85]. 

(d)  The  subcircuit  partitioning  and  ordering  may  optionally  be  specified  manually, 
rather  than  being  generated  automatically. 


(e)  Window  boundaries  may  optionally  be  specified  manually,  rather  than  being 
determined  automatically. 

(f)  Optional  output  files  may  be  requested  containing  the  subcircuit  partitioning  and 
ordering,  and  the  initial  conditions  used.  These  files  are  in  a  format  acceptable  for 
input  to  the  program  on  a  later  run.  This  allows  many  transient  analysis  runs  to 
be  made  without  repeating  the  partitioning,  ordering,  and  initial  dc  solution  com¬ 
putations.  Also,  it  facilitates  experiments  in  which  minor  changes  are  made  in  the 
automatically  generated  partitioning. 

(g)  The  rules  for  specifying  periodic  piecewise  linear  voltage  sources  are  more  user 
friendly. 

Since  RELmod  uses  the  same  numerical  algorithms  as  FWT  without  any  of  the  extra 
parallel  processing  code,  it  serves  as  an  ideal  reference  to  which  the  performance  of 
FWT  can  be  compared.  Comparisons  of  the  run  times  of  RELmod  and  FWT  indicate 
that  the  extra  overhead  of  FWT  on  1  processor  increases  the  run  time  by  less  than  2%. 

5J5  FWT 

The  FWT  program  uses  RELmod  as  a  base,  and  implements  the  full  window  tech¬ 
nique  for  parallel  waveform  relaxation.  It  can  be  run  on  1  to  8  processors  of  an  Alliant 
FX/8.  Parallelism  is  exploited  only  in  the  transient  analysis  phase  of  the  program,  since 
this  is  the  most  time  consuming  and  the  area  of  greatest  potential  payoff. 

The  basic  idea  behind  the  FWT  program  implementation  is  quite  simple.  A  tem¬ 
plate  of  the  augmented  task  graph  T  is  constructed  for  the  specified  relaxation  method. 
In  each  window,  the  initial  tasks  are  placed  on  the  task  system  queue.  The  task  system 
assigns  tasks  from  the  queue  to  available  processors.  As  each  task  finishes  execution,  it 
checks  its  successor  tasks  and  queues  those  for  which  all  the  input  waveforms  have 


been  computed.  Also,  just  before  terminating,  a  task  performs  a  convergence  check  on 
its  own  waveforms,  and  checks  the  accumulated  convergence  status  of  all  other  tasks  in 
the  iteration  to  see  if  convergence  was  obtained  on  that  iteration.  When  convergence  is 
detected,  all  executing  and  queued  tasks  are  killed  and  a  new  window  is  started. 

Within  each  window,  the  iterations  are  partitioned  into  iteration  groups,  such  that 
each  group  contains  k  consecutive  iterations.  A  global  synchronization  is  performed 
between  each  iteration  group.  The  default  value  of  k  is  6.  There  are  several  reasons  for 
adding  these  global  synchronization  points.  The  primary  reason  is  that  it  simplifies  the 
implementation  of  the  periodic  reductions  of  the  window  size  and  error  tolerance  which 
occur  when  too  many  iterations  are  used.  The  second  reason  is  that  it  allows  for  fixed 
limits  to  be  placed  on  certain  arrays  which  require  a  separate  array  element  for  each 
iteration  of  the  group.  Finally,  it  limits  the  number  of  unnecessary  tasks,  with  itera¬ 
tion  numbers  greater  than  the  converging  iteration  number,  which  may  be  executed 
before  convergence  is  detected. 

The  added  synchronization  points  between  iteration  groups  can  result  in  reduced 
parallelism.  However,  in  most  windows  convergence  is  obtained  in  the  first  iteration 
group,  and  the  extra  synchronizations  do  not  occur.  Furthermore,  the  Type  3  speedup 
estimates  in  Tables  4.5  and  4.6  indicate  that  the  parallelism  is  only  a  weak  function  of 
the  number  of  iterations,  especially  when  the  number  of  iterations  is  greater  than  4. 
Therefore,  even  if  several  groups  of  6  iterations  are  required  in  a  window,  the  speedup 
in  each  group  will  be  about  the  same  as  could  be  obtained  without  the  synchronizations 
between  groups. 

The  algorithm  for  the  transient  analysis  phase  of  FWT,  using  iteration  groups,  is 
outlined  below.  The  currert  window  boundaries  are  represented  by  ta  and  tb ,  ksiarj 
and  kstof)  are  the  first  and  last  iterations  of  the  current  iteration  group.  The 


global jconvtrgtd  flag  becomes  TRUE  when  convergence  is  detected  in  all  subcircuits  on 
some  iteration.  For  each  subcircuit  there  are  k  counters,  one  associated  with  each  itera¬ 


tion  of  the  group.  Counter  unsat  t  ,  represents  the  number  of  unsatisfied  precedence 
constraints  for  subcircuit  evaluation  task  (k  .  i ).  £e.,  the  number  of  predecessors  in  the 
task  graph  which  have  not  finished  execution. 


Algorithm  5A  FWT  Transient  Analysis 


while  ( ta  <tf  )  |  /’window  loop  */ 

choose  tb 


global  jconverged  —FALSE 
repeat  |  /*  iteration  group  loop  */ 

initialize  unsatt  { .  for  knatt  £k  £kaop  .  1 
queue  initial  tasks  for  iteration  group 
Parallel^  TaskController  () 
if  {global jconvtrgtd  = FALSE  )  { 


^  it  art  *~^nof 
"■nap  "-stop  T  * 

reduce  integration  error  tolerance 
reduce  tb 

I 

}  until  ( global _converged  -TRUE  ) 
reinitialize  integration  error  tolerance 


The  program  runs  on  a  single  processor  except  when  the  parallel  task  controller  is  run¬ 
ning.  in  which  case  Nprocs  processors  are  used,  for  a  specified  value  of  Nprocs  between  1 
and  S. 


The  subcircuit  evaluation  tasks  are  executed  under  the  control  of  the  parallel  task 
controller.  The  initial  tasks  are  queued  prior  to  turning  control  over  to  the  parallel  task 
controller.  These  are  the  tasks  which  have  an  in-degree  of  zero  in  the  task  graph,  after 
all  tasks  of  previous  iteration  groups  are  removed.  Before  terminating,  a  task  updates 
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the  unsat  counters  of  its  successors  and  queues  successor  tasks  which  are  ready  to  exe¬ 
cute.  as  specified  in  the  following  algorithm. 

Algorithm  5.7.  Subcircuit  Evaluation  Task  U.i) 

/•  Solve  subcircuit  V 

Solve  subcircuit  i  on  iteration  k  over  time  interval  [r0  .  tb  ] 

/*  Check  successors  */ 

for  each  (successor  (k^  .  )  of  (*  .  i )  in  T  )  { 

LOCK  ( unsat  Jock ,  ) 

me 

unsat.  '  •-unsat  k  t  —1 

*  saK  1  she  me  '  me 

if  (unsat t  ,  *0)  queue  jask  (k^  .  ) 

UNLOCKOunmr  lock,  ) 

me 

) 

/*  Check  convergence  •/ 

if  (v/* )  matches  v/*  within  tolerance,  for  t  €[ffl .  tb  D  { 

LOCK(unconv~locki ) 
unconvt  —unconvk  —1 
if  (unconvk  =0)  conv  •-TRUE 
else  conv  —FALSE 
UNLOCK(unconv_tockt ) 
if  (conv  ssTRi.  2  )  { 

global, jconverged  —TRUE 

KiU  Tasks  0 

) 


5.6  Data  Structures 

The  FWT  program  uses  the  augmented  form  of  the  task  graph,  T ,  which  simplifies 
the  management  of  data  structures  and  reduces  the  required  memory  space.  Since  only 
one  instance  of  any  given  subcircuit  cars  be  active  at  a  lime,  most  data  structures  associ¬ 
ated  with  the  subcircuit  need  only  be  allocated  once,  and  may  be  reused  by  each  task 
which  is  an  instance  of  the  subcircuit.  The  most  important  data  structures  which  fall 
into  this  category  include 
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(a)  space  for  the  matrix  and  vectors  representing  the  linearized  system  of  equations  on 
each  Newton  iteration; 

(b)  the  list  of  devices  contained  in  the  subcircuit,  along  with  parameter  values  and 
pointers  into  the  matrix  which  indicate  where  contributions  from  the  model  equa¬ 
tions  should  be  loaded  into  the  matrix  on  each  Newton  iteration; 

(c)  relative  pointers  to  successor  tasks:  and 

(d)  the  time  values  and  vectors  of  voltages,  currents,  and  charges  at  the  last  few  time 
points  as  required  by  the  integration  algorithm. 

Some  data  structures,  however,  require  separate  copies  corresponding  to  different  itera¬ 
tions  of  a  subcircuit,  including 

(a)  the  unsat  counters,  and 

(b)  waveform  buffers  which  contain  the  time/voltage  pairs  for  the  time  points  com¬ 
puted  at  each  node  in  the  current  window. 

The  number  of  unsat  counters  is  Kn  .  and  these  counters  are  simply  allocated  once  and 
initialized  at  the  beginning  of  each  iteration  group.  The  waveform  buffers  represent  a 
larger  investment  in  storage  space. 

Even  though  only  one  instance  of  each  subcircuit  will  be  active  at  any  time,  it  is 
possible  in  general  that  all  k  instances  of  the  resulting  waveforms  of  a  given  subcircuit 
may  be  required  simultaneously  by  other  subcircuits.  Consider  the  example  in  Fig.  5.2 
of  a  circuit  which  is  not  bidirectionally  coupled.  If  the  instances  of  subcircuits  1  and  2 
require  a  small  computation  time  compared  to  the  instances  of  subcircuit  3.  it  is  possi¬ 
ble  that  all  k  instances  of  subcircuits  1  and  2  will  be  finished  before  the  first  instance  of 
subcircuit  3  finishes.  Thus,  while  task  (1.  3)  is  executing,  the  results  of  (1.  1)  and  (1. 
2)  are  needed  as  inputs  to  (1.  3),  and  the  results  of  all  other  instances  of  subcircuits  1 
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(1.1)  (2.1)  (3.1) 


(a) 


(b) 


Figure  5.2.  Waveform  buffer  example;  (a)  G  .  (b)  Tas.y 


and  2  must  be  saved  for  future  use.  To  accommodate  this  situation,  a  provision  must 
be  made  to  have  k  waveform  buffers  existing  simultaneously  for  the  nodes  of  subcir¬ 
cuits  1  and  2. 

Sometimes,  separate  waveform  buffers  are  not  needed  for  each  iteration.  The  next 
theorem  provides  a  limit  on  the  number  of  simultaneously  required  waveform  buffers 
in  the  case  of  bidirectionally  coupled  circuits. 

Theorem  5.1.  If  C  is  bidirectionally  coupled,  then  no  mare  than  2  waveform  buffers  are 
needed  for  each  node  at  any  one  time  in  the  FWT  program  based  on  either  ^cj.k  w  Tqs.k- 

Let  (k  ,  i  )  be  any  task  in  the  task  graph,  and  define  a  set  T  of  tasks  which  are  instances 
of  subcircuit  i  such  that 

T*  K/n.i):  l^m  ^*-2}.  (5.1) 

When  task  (.k  ,i )  begins  execution,  it  requires  waveform  buffers  to  store  the  results  for 

its  internal  nodes.  This  is  the  first  time  that  these  waveform  buffers  are  needed.  If  the 

contents  of  a  waveform  buffer  are  computed  by  task  x  ,  then  the  waveform  buffer  is  no 
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longer  needed  after  all  successors  of  x  have  terminated.  Since  (k  ,i)  begins  executing 
before  any  instance  of  i  with  a  greater  iteration  number,  it  is  sufficient  to  show  that 
when  (A  .  i )  starts,  all  successors  of  tasks  in  T  have  already  finished  execution.  If  this 
is  true,  then  every  time  a  new  waveform  buffer  is  needed  for  a  node,  only  one  other 
waveform  buffer  for  that  node  on  a  different  iteration  is  required  simultaneously— 
namely,  the  waveform  buffer  of  the  previous  iteration.  Equivalently,  it  is  sufficient  to 
show  that  there  is  a  path  to  (A  .  i )  from  each  successor  of  each  task  in  T.  Let  (A0.  »0)  be 
any  successor  of  any  task  in  I*.  Note  that  AU<A  — 1.  There  is  a  path  from  (A0.  i0)  to 
(A  .  i  )  of  the  form 

(A0.i0).(A0.‘  ).(A0+l.i  ).  •  •  •  .(A—  l.i  ).(A  ,i ) 
or 

(A0.i0).  (A0+l.i  ).  •  •  •  .(A— l.i  ),(A  ,i  ). 

All  arcs  in  the  path  except  the  first  are  arcs  which  are  added  when  the  augmented  graph 
graph  is  constructed  from  the  unaugmented  graph.  The  first  arc  must  exist  because 
(A0.i0)  is  a  successor  of  an  instance  of  i  and  the  circuit  is  bidirectionally  coupled. 
Hence,  no  more  than  2  waveform  buffers  are  needed  for  each  node.  Note  that 
waveform  buffers  for  iterations  A  and  A  —1  are  required  simultaneously  in  order  to 
perform  convergence  checking. 

To  handle  general  circuits,  the  program  must  be  able  to  provide  separate 
waveform  buffers  for  each  node  on  each  iteration,  all  existing  simultaneously.  This 
amounts  to  kN  waveform  buffers,  where  N  is  the  total  number  of  nodes  in  the  circuit. 
However,  for  bidirectionally  coupled  circuits,  only  2N  waveform  buffers  are  needed  at 
any  one  time.  In  order  to  accommodate  general  circuits  and  not  waste  excessive 
memory  space  for  bidirectionally  coupled  circuits,  the  FWT  program  dynamically  allo¬ 
cates  waveform  buffers  as  needed.  A  pool  of  available  waveform  buffers  is  maintained 
on  each  processor.  When  a  processor  runs  out  of  waveform  buffers,  it  obtains  an 


90 


additional  set  of  buffers  by  executing  the  dynamic  memory  allocation  program. 
Although  the  dynamic  memory  allocator  is  executed  in  a  critical  section  by  at  most  one 
processor  at  a  time,  contention  between  different  processors  is  minimized  because  most 
requests  for  waveform  buffers  are  satisfied  locally  from  the  processor's  own  pool  of 
waveform  buffers.  A  counter  is  associated  with  each  waveform  buffer,  indicating  how 
many  tasks,  which  have  not  yet  terminated,  require  the  use  of  the  waveform  buffer. 
The  program  is  designed  such  that  when  the  count  reaches  0.  the  waveform  buffer  is 
freed  by  returning  it  to  the  pool  of  available  buffers.  This  design  has  not  been  fully 
implemented  in  the  current  version  of  FWT.  Currently,  waveform  buffers  are  not 
freed  until  the  end  of  the  iteration  group.  Due  to  the  large  virtual  address  space,  this 
has  not  presented  a  problem  in  the  simulation  of  the  benchmark  circuits. 

5.7  Results 

The  performance  results  for  the  FWT  program  are  given  in  Tables  5.1  and  5.2. 
The  results  are  compared  with  the  PARASITE  results  in  Table  5.3.  The  FWT  speedups 
on  8  processors  are  within  11%  of  the  PARASITE  estimates.  The  differences  between 
the  measured  speedups  and  the  estimates  include  the  overhead  factors  and  the  estimate 
errors  noted  in  Chapter  4.  The  FWT  program  sometimes  generates  slightly  different 
window  boundaries  under  multiprocessing  conditions  than  in  the  uniprocessor  reference 
run.  which  introduces  an  additional  small  error  in  the  comparison.  As  a  point  of  refer¬ 
ence.  the  scdac  runs  using  Gauss-Seidel  used  exactly  the  same  window  boundaries  in  all 
the  multiprocessor  and  uniprocessor  runs.  The  good  agreement  between  the  estimates 
and  the  actual  results  indicate  that  overhead  plays  a  minor  role  in  determining  the  per¬ 
formance  of  FWT  on  these  benchmark  circuits. 


i  v  I.* i  J  pvY*  ,\t , V ,v»  v * 


Table  5.1.  FWT  Gauss-Seidel  Speedups 


Circuit 


1 

Processors 

2  4 

8 

1.0 

1.4 

1.4 

1.4 

1.0 

1.6 

2.0 

2.0 

1.0 

2.0 

3.6 

4.9 

1.0 

1.6 

1.9 

2.1 

1.0 

2.0 

3.2 

3.9 

Table  5.2.  FWT  Gauss-Jacobi  Speedups.  Normalized  to  Gauss-Seidel 


Circuit 


Table  5.3.  Comparison  of  FWT  and  PARASITE  on  8  Processors 


Circuit 


Method 


Gauss- 

Seidel 


Gauss- 

Jacobi 


FWT 

Normalized  Speedup 
PARASITE  Difference 

Processor 

Utilization 

1.4 

1.5 

7% 

18% 

2.0 

2.1 

5% 

25% 

4.9 

5.3 

8% 

61% 

2.1 

2.3 

9% 

27% 

3.9 

3.8 

-3% 

48% 

3.1 

3.1 

0% 

55% 

4.0 

4.5 

11% 

63% 

4.5 

4.5 

0% 

80% 

4.0 

4.4 

9% 

79% 

4.8 

5.2 

8% 

86% 

The  processor  utilizations  shown  in  Table  5.3  are  the  unnormalized  FWT  speedups 
divided  by  the  number  of  processors.  Consequently,  these  numbers  do  not  include  the 
time  during  which  processors  are  utilized  to  perform  those  overhead  computations 
which  occur  in  the  8  processor  case  but  not  in  the  1  processor  case,  and  the  time  spent 
by  processors  waiting  for  access  to  shared  resources. 
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5.8  Presimulation  Selection  of  Gauss-Jacobi  or  Gauss-Seidel 


When  confronted  with  a  circuit  to  simulate  and  a  multiprocessor  with  a  given 
number  of  processors  on  which  to  perform  the  simulation,  the  choice  between  the 
Gauss-Seidel  and  Gauss-Jacobi  relaxation  methods  should  take  both  the  circuit  struc¬ 
ture  and  number  of  processors  into  account.  For  the  benchmark  circuits,  the  fastest  of 
the  two  methods  is  indicated  in  Table  5.4.  as  a  function  of  the  number  of  processors, 
based  on  the  performance  results  presented  in  Tables  5.1  and  5.2.  In  the  typical  situa¬ 
tion  for  a  given  circuit.  Gauss-Seidel  is  the  faster  method  when  the  number  of  proces¬ 
sors  is  small.  As  the  number  of  processors  is  increased,  a  break-even  point  pb  is  reached 
such  that  if  more  than  pb  processors  are  used  then  the  Gauss-Jacobi  method  is  faster 
than  Gauss-Seidel.  The  presimulation  estimates  of  Chapter  4.  together  with  rules  of 
thumb  obtained  from  performance  data  of  other  circuits,  can  be  used  to  estimate  pb  and 
therefore  serve  as  a  guide  in  selecting  the  relaxation  method  prior  to  performing  the 
simulation. 

At  the  break-even  point,  the  Gauss-Seidel  method  typically  has  nearly  reached  its 
maximum  possible  speedup,  whereas  the  normalized  Gauss-Jacobi  speedup  is  still 
increasing  nearly  linearly,  with  a  slope  close  to  the  uniprocessor  normalized  speedup. 

Table  5.4.  Fastest  Method  vs.  Number  of  Processors 


Circuit 

1 

Processors 

2  4 

8 

dvs 

S 

S 

J 

J 

dpla 

S 

- 

J 

J 

scdac 

s 

S 

s 

S 

ben2k 

S 

s 

J 

J 

digfi 

S 

s 

s 

J 

Key:  S-Gauss-Seidel.  J -Gauss-Jacobi 

Despite  the  variety  of  types  of  circuits  included  in  the  set  of  benchmark  circuits,  the 
Gauss-Jacobi  single  processor  normalized  speedup  is  consistently  close  to  0.7.  There¬ 
fore.  at  the  break-even  point,  the  normalized  Gauss-Jacobi  speedup  is  approximately 
0.7 pb .  Since  the  Gauss-Seidel  and  Gauss- Jacobi  speedups  are  equal  at  the  break-even 
point,  it  follows  that 


S GS.ett 


(5.2) 


where  ^GS.est  is  an  estimate  of  the  maximum  Gauss-Seidel  speedup  on  an  unlimited 


number  of  processors.  The  maximum  Gauss-Seidel  speedup  can  be  estimated  by  one  of 


the  presimulation  estimation  techniques  of  Chapter  4.  If  the  Type  3  estimate  is  used. 


based  on  the  assumption  that  2  iterations  will  be  used,  then  the  break-even  estimates  of 
Table  5.5  result.  These  estimates  agree  quite  well  with  the  observed  break-even  points 
in  Table  5.4. 


It  should  be  noted  that  the  constant  0.7  used  in  the  break-even  estimate  may  in’ 
general  depend  on  the  partitioning  and  windowing  algorithms.  Any  alterations  to  these 
algorithms  would  necessitate  a  reconsideration  of  the  value  of  the  constant.  It  is  also 
possible  that  certain  types  of  circuits  will  not  agree  with  this  choice  of  the  constant. 
Experience  with  a  larger  number  of  circuits  is  required  to  determine  if  a  single  value  for 


Table  5.5.  Presimulation  Estimate  of  Break-even  Point 


94 


0 


R 


8 


the  constant  always  produces  a  reasonable  estimate,  or  if  different  classes  of  circuits 
each  need  a  different  value  for  the  constant. 
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CHAPTER  6 


TIME  POINT  PIPELINING  IMPLEMENTATION 


The  implementation  of  time  point  pipelining  in  the  TPP  program  is  described  in 
this  chapter,  and  performance  results  are  presented.  The  TPP  program  is  based  on  the 
FWT  program,  with  modifications  to  the  procedures  which  determine  when  a  task  is  eli¬ 
gible  to  exec-%  and  to  the  related  data  structures.  The  important  implementation 
issues  of  time  point  pipelining  are  exposed  by  describing  how  they  were  addressed  in 
the  TPP  program.  The  performance  results  show  that  TPP  produces  faster  run  times 
than  FWT  when  there  are  sufficiently  many  processors  to  make  use  of  the  extra  paral¬ 
lelism  of  time  point  pipelining.  When  too  few  processors  are  used,  the  extra  overhead 
of  TPP  results  in  slower  run  times  than  FWT. 

6.1  Algorithms 

The  coordination  of  parallel  computations  is  significantly  more  complex  in  TPP 
than  FWT.  In  FWT.  each  task  consists  of  the  evaluation  of  a  subcircuit  over  an  entire 
window  on  a  single  iteration,  and  the  task  graph  for  an  iteration  group  is  known  a 
priori.  The  eligibility  of  each  task  for  execution  is  monitored  simply  through  its  unsat 
counter,  which  contains  the  number  of  incoming  arcs  in  the  task  graph  from  tasks 
which  have  not  yet  finished  executing.  In  TPP.  computations  are  coordinated  at  the 
subtask  level,  where  a  subtask  consists  of  the  evaluation  of  a  subcircuit  at  a  single  time 
point  on  a  single  iteration.  The  subtasks  cannot  be  identified  prior  to  beginning  the 
iteration  group,  because  the  variable  time  steps  used  by  the  integration  algorithm 
depend  on  the  waveforms  which  are  computed  during  the  iteration  group.  Conse¬ 
quently.  the  subtask  graph  cannot  be  constructed  a  priori,  and  the  use  of  unsat  counters 


to  monitor  the  eligibility  of  individual  subtasks  for  execution  is  not  feasible.  Instead, 
the  subtasks  to  be  executed  and  the  precedence  constraints  between  subtasks  must  be 
determined  dynamically  during  the  solution  process,  based  on  the  known  full  window 
task  graph  and  based  on  the  actual  time  point  values  selected  by  the  integration  algo¬ 
rithm. 

In  addition  to  the  time  point  subtasks.  TPP  defines  an  initialization  subtask  which 
precedes  the  first  time  point  computation  in  each  subcircuit  evaluation  task,  and  a  con¬ 
vergence  checking  subtask  which  follows  the  last  time  point  computation  in  each  task. 
The  coordination  of  the  execution  of  all  the  subtasks  is  facilitated  by  a  set  of  control 
variables.  Each  of  the  *n  subcircuit  evaluation  tasks  of  an  iteration  group  is  allocated  a 
set  of  control  variables.  { status  .  t  done .  tjiext ,  thready .  waiting_f  or .  lock  }.  The  con¬ 
trol  variables  are  initialized  prior  to  starting  each  iteration  group.  The  meanings  of  the 
variables  are  summarized  below. 

status : 

The  status  variable  has  the  value  UNINITIALIZED  if  the  initialization  subtask 
has  not  been  executed  yet.  Its  value  is  INITIALIZED  if  the  initialization  is  com¬ 
pleted  and  the  last  time  point  of  the  window  has  not  yet  been  computed.  After 
the  last  time  point  is  computed  its  value  is  set  to  INTEGRJX)NE . 

tjdone  : 

The  tjdone  variable  represents  the  time  through  which  the  subcircuit  has  been 
solved  in  the  iteration,  and  it  is  initialized  to  ta  .  the  initial  time  in  the  window. 

tjiext : 

In  most  cases,  tjiext  represents  the  time  of  the  next  time  point  to  be  computed  by 
the  task,  that  is  tjdone  -tjiext  +h  where  h  is  the  step  size  determined  from  the 
local  truncation  error  of  the  integration  method.  More  generally,  tjiext  is  the 


time  such  that  if  all  predecessor  tasks  have  progressed  at  least  to  tjiext .  then  the 
task  is  eligible  to  execute  its  next  subtask.  The  initial  value  of  tjiext  is  ta  +8. 
where  8  is  smaller  than  a  minimum  time  step.  Consequently,  as  soon  as  all  prede¬ 
cessors  of  a  task  compute  at  least  one  time  point,  the  task  becomes  eligible  to  exe¬ 
cute  its  initialization  subtask. 

tjeady : 

The  value  of  tjeady  is  a  time  through  which  ail  input  waveforms  of  the  subcir¬ 
cuit  evaluation  task  are  guaranteed  to  be  valid.  The  tjeady  variable  is  always 
updated  in  synchronization  with  waiting_f  or . 

waiting^/  or : 

If  a  task  is  not  queued  or  running,  then  waiting,  J  or  points  to  one  of  its  predeces¬ 
sors  which  must  advance  before  the  task  becomes  eligible  for  execution.  Other¬ 
wise.  if  a  task  is  queued  or  running,  then  waiting  for  is  NULL  .  indicating  that  it 
is  not  waiting  for  data  from  another  task.  When  a  task  is  not  queued  or  running, 
the  pr.  lessor  indicated  by  waiting_for  is  responsible  for  updating  the  task’s 
tjeady  and  waiting_for  variables,  and  for  queuing  the  task  when  it  is  ready  to 
execute.  Note  that  the  waiting_f  or  predecessor  can  transfer  this  responsibility  to 
another  predecessor  by  modifying  waiting_f  or  . 

lock : 

This  is  a  lock  which  is  used  to  synchronize  updates  of  the  control  variables,  when 
explicit  synchronization  is  necessary. 

Much  of  the  infrastructure  of  FWT  is  used  without  change  in  TPP.  including  all 
the  task  system  algorithms  presented  in  Chapter  5.  The  basic  transient  analysis  algo¬ 
rithm  outlined  in  Algorithm  5.6  is  also  applicable  to  TPP.  with  the  addition  of  the  con¬ 
trol  variable  initializations.  The  FWT  subcircuit  evaluation  task  of  Algorithm  5.7. 


which  embodies  all  the  parallel  computations,  is  modified  for  TPP.  Although  the 
numerical  computations  which  it  performs  are  essentially  unchanged,  the  control  of  the 
computations  is  different.  The  subcircuit  evaluation  task  is  partitioned  into  subtasks 
which  are  performed  sequentially.  After  each  subtask  is  executed,  its  successor  tasks 
are  checked.  Each  successor  which  is  ready  to  execute  at  least  one  subtask  is  queued.  If 
a  task  runs  out  of  input  data  before  all  the  subtasks  are  executed,  then  the  task  ter¬ 
minates  prematurely.  It  is  requeued  later  by  its  waiting_f  or  predecessor  when 
sufficient  input  waveforms  become  available  to  allow  the  computation  of  the  next  sub¬ 
task.  An  outline  of  the  subcircuit  evaluation  task  in  TPP  is  given  below. 

Algorithm  6.1.  TPP  Subcircuit  Evaluation  Task  (4  .  i ) 

/*  initialization*/ 

terminate  —FALSE ;  contjntegration  —FALSE 
if  (statuski— UNINITIALIZED  )  { 
initialize  data  structures 
*  pick  next  time  value.  t_nextk  , 
status k  ,  - INITIALIZED 
LOCKO ockki) 

update  t_readyk  t  and  waiting_f  ort  i 

if  ( waiting_f  ork  ,  &NULL  )  terminate  —TRUE 

UNLOCKUoc**  ■ ) 

} 

/*  integration  */ 

if  ((status  ki  -INITIALIZED  )  and  (terminate  =  FALSE  ))  contjntegration  —TRUE 
while  ( contjntegration  -TRUE  )  { 

compute  time  point  at  time  t  €  (t_donek  , .  t_nextk  ,  ] 
t_donek  j  —t 

pick  next  time  value.  t_nextk  , 

Check JSucces sors  (k,i) 
if  (t_donet  t  Ztb  )  { 

status k  ,  —INTEGR_DONE 
cont  integration  —FALSE 

} 

else  if  ( t_nextk  ,  >t_readyk  , )  { 

LOCK  (lockkl) 
limit  t_nextk  , 

update  t_readyk  l  and  waiting_fork  i 


if  (waiting-/  or ki  &NULL  )  cont -integration  —FALSE 
UNLOCKOoc**  j) 

I 

} 


/*  convergence  checking  */ 
if  (status ki  =INTEGR_DONE  )  check  convergence 

A  successor  check  is  performed  after  each  time  point  is  computed.  In  addition  to  the 

fact  that  successor  checks  are  performed  many  more  times  in  TPP  than  in  FWT.  the 

successor  check  algorithm  is  more  complicated  in  TPP.  because  the  subtask  graph  is  not 


known  a  priori.  The  successor  checking  algorithm  is  summarized  as  follows: 


Algorithm  6.2.  Check.Successors  (k.i) 

if  (less  than  4  time  points  have  been  computed  in  (k.i))  tjref  * - 1 

else  tj-ef  —  (the  time  of  the  third  previous  time  point  in  (k.i )) 
for  each  (successor  (kt . t, )  of  (k.i ))  { 
queueit  —FALSE 
LOCKttodt*  i  ) 

»*  I 

/*  3-step  check  */ 

if  ((waiting_  f  or k  ,  *NULL  )  and  (t_re f  >t_donek  t  )  and 
(t_nextk  ;  >t_donek  .  ))  { 

i'  i 

t_nextk  j  —t_donek  i 
waiting-/ ort  t  —(k.i) 

I 

/*  queuing  check  */ 

if  ((waiting-/ ork  ■  -(k.  i ))  and  (tjxextk  i  < t_donek  ; ))  { 
update  t -ready k  (  and  waiting- f  or k  t 
if  (waiting-/  or k  t  -NULL  )  queueit  —TRUE 

} 

UNLOCK (lockt  /  ) 

if  (queueit  =TRUE )  queue  task  (ks ,  i, ) 

} 


Both  Algorithms  6.1  and  6.2  perform  updates  of  the  t_ready  and  waiting-/ or  control 
variables,  using  the  following  algorithm: 


Algorithm  63.  Update  waiting^/  or k  ,  and  t_readyk  , 

(Jkp  ,  ip  )«-  (a  predecessor  of  (k.  i )  with  minimum  t_donek  t  ) 
t_readyki—t_donek  * 

if  (t_readyk  i  ^tjiextkl )  waiting_forki  *-NULL 
else  waiting_f  or  *-(kp  .  if  ) 

6.1.1  Determining  and  modifying  tjiext 

Except  during  initialization  and  convergence  checking,  tjiext  ki  is  the  tentative 
time  value  of  the  next  time  point  to  be  computed  by  task  (k,  i  ).  The  actual  time  value 
of  the  next  time  point  may  turn  out  to  be  less  than  tjiext  ki .  if.  after  the  initial  attempt 
to  compute  the  time  point,  the  step  size  is  reduced  due  to  excessive  local  truncation 
error  or  excessive  Newton  iterations.  The  final  value  of  tjiext k  {  just  prior  to  the  first 
attempt  to  compute  the  time  point  is  given  by 

tjiext t  i  *  min [tjionek  t  +h,  tb ,  tlimk  }.  (6.1) 

In  the  first  expression,  h  is  the  time  step  selected  by  the  integration  algorithm  in  its 

attempt  to  produce  a  local  truncation  error  which  will  be  approximately  equal  to.  but 
not  greater  than,  the  specified  error  tolerance.  The  tb  bound  is  simply  the  upper  win¬ 
dow  boundary.  The  tlimil  bound  has  the  effect  of  limiting  the  step  size  in  (k,  i )  to  be  no 
greater  than  3  steps  of  any  of  its  predecessors.  The  reason  for  this  limit  is  that  the  local 
truncation  error  estimate  is  based  on  derivatives  of  the  waveforms  computed  from 
divided  differences  of  previous  time  points.  A  sudden  change  in  a  circuit  input 
waveform,  or  a  circuit  nonlinearity,  may  cause  a  sudden  change  in  a  subcircuit  input 
waveform  which  cannot  be  anticipated  by  looking  only  at  the  past  history  of  the 
waveform.  By  forcing  the  subcircuit  to  be  evaluated  at  least  once  for  every  3  input 
steps,  these  sudden  unanticipated  transitions  will  not  be  skipped  over  by  inappropri¬ 
ately  long  steps  based  only  on  estimated  truncation  errors. 


In  uniprocessor  waveform  relaxation  and  in  the  FWT  program.  (6.1)  can  be 
evaluated  as  soon  as  (*.  i )  computes  the  time  point  at  t_donek  i .  because  all  predecessor 
waveforms  are  guaranteed  to  be  available  for  the  entire  window.  However,  in  TPP,  the 
input  waveforms  may  not  be  available  beyond  t_doneki.  Therefore,  following  the 
computation  of  the  time  point  at  tjdonek  { .  an  initial  value  is  computed  for  t_nextk 
based  on  the  information  available  at  that  time.  As  more  input  waveform  points 
become  available  in  the  future,  it  is  necessary  to  reduce  the  value  of  t_nextk  t  if  one  of 
the  inputs  takes  3  steps  in  the  interval  ( t_donek  , ,  t_nextt  . ). 

One  possible  approach  for  handling  this  situation  in  TPP  would  be  to  leave 
t_nextk  i  unchanged  until  all  the  predecessor  waveforms  become  available  through 
tjtextki,  at  which  time  task  (*.  i )  would  become  eligible  to  compute  its  next  time 
point.  Then  all  the  information  would  be  available  to  reduce  t_nextk  t  if  necessary 
before  actually  computing  the  time  point.  This  approach  is  not  used  in  TPP.  because  it 
can  have  a  significant  negative  impact  on  the  time  point  pipelining  parallelism  as 
demonstrated  in  the  following  example. 

Suppose  all  the  subcircuits  start  in  a  dc  steady  state  at  the  beginning  of  a  window. 
The  truncation  error  estimate  will  indicate  that  an  arbitrarily  long  time  step  can  be 
taken,  and  the  tjiext  values  for  the  first  time  point  of  all  tasks  will  be  tb ,  based  on  the 
available  information  at  initialization  time,  except  for  those  subcircuits  connected  to 
external  voltage  sources  which  make  transitions  in  the  window.  The  subcircuits  con¬ 
nected  to  the  voltage  sources  may  compute  many  time  points  before  reaching  the  end  of 
the  window.  However,  if  the  tjiext  values  of  their  successors  are  not  modified,  the 
successors  will  be  prevented  from  computing  concurrent  time  points  during  this  inter¬ 
val  because  they  will  be  waiting  for  their  predecessors  to  reach  t_nextk  ,  -tb . 


This  problem  is  avoided  in  TPP  by  checking  the  tjiext  values  of  all  successors 
after  each  time  point  is  computed.  The  appropriate  updating  of  successors  is  done  in  the 
"3-step  check*  section  of  Algorithm  6.2.  for  those  successors  which  are  not  running  or 
queued  at  the  time  of  the  successor  check.  Tasks  which  are  running  or  queued  are 
responsible  for  maintaining  their  own  tjiext  values. 

6.1.2  Synchronization  of  control  variables 

Some  of  the  control  variables  may  be  accessed  and  updated  by  different  tasks,  and 
therefore  special  precautions  are  required  to  assure  their  integrity.  Parallel  accesses  to 
the  control  variables  of  a  task  are  synchronized  through  the  use  of  the  lock  and 
waiting-/  or  control  variables.  When  waitings f  or  —NULL  .  the  owner  task  is  responsi¬ 
ble  for  maintaining  its  own  control  variables,  and  no  other  task  may  modify  them. 
When  waiting^/  or  ^ NULL .  then  only  tasks  other  than  the  owner  may  modify  the 
control  variables,  but  they  must  do  so  in  a  critical  section  protected  by  the  lock  control 
variable. 

When  the  owner  task  modifies  its  own  control  variables,  it  is  normally  not  neces¬ 
sary  to  acquire  the  lock  first,  because  waiting-/ or  -NULL  and  no  other  task  is  eligible 
to  modify  the  variables.  However,  when  the  owner  task  changes  its  waiting-/ or  value 
to  non-NULL  .  the  lock  must  be  used,  and  all  the  status  variables  must  be  up  to  date  at 
the  moment  that  the  lock  is  released.  When  the  lock  is  released,  the  responsibility  for 
updating  the  task's  control  variables  is  transferred  to  the  task's  predecessors.  In  partic¬ 
ular.  tjready  must  be  up  to  date  in  order  for  the  waiting-/  or  predecessor  to  be  able  to 
properly  queue  the  task  when  sufficient  time  points  are  available.  And  tjiext  must  be 
up  to  date  based  on  all  available  input  data  points,  because  only  those  data  points  com¬ 
puted  by  predecessors  in  the  future  will  affect  tjiext  through  the  successor  check.  The 
use  of  lock  and  waiting-/  or  to  synchronize  accesses  to  the  control  variables  is  shown  in 


Algorithms  6.1  and  6.2. 


6.13  Convergence  checking 

Although  not  shown  explicitly  in  Algorithm  6.1.  complications  arise  in  scheduling 
the  convergence  checking  subtask  in  TPP  because  TPP  uses  the  unaugmented  task 
graphs,  as  opposed  to  the  augmented  task  graphs  used  by  FWT.  The  reason  for  using 
the  unaugmented  task  graphs  is  discussed  in  the  next  section  on  data  structures.  In  this 
section,  the  impact  of  the  unaugmented  task  graph  on  the  convergence  checker  is  con¬ 
sidered. 

The  convergence  checking  subtask  referenced  in  Algorithm  6.1  is  identical  in  con¬ 
tent  to  the  convergence  checking  portion  of  Algorithm  S.7.  However,  extra  precedence 
constraint  checks  must  be  performed  in  the  TPP  program.  Recall  that  the  unaugmented 
task  graph  does  not  include  the  dependency  of  (k.  i )  on  (k  +1.  i  ).  Therefore,  it  is  possi¬ 
ble  that  task  (k  +1,  i )  will  finish  computing  its  last  time  point  before  task  (*.  i )  com¬ 
putes  its  last  time  point.  This  situation  can  arise  in  Gauss-Jacobi  relaxation  even  for 
bidirectionally  coupled  circuits.  When  this  situation  occurs,  task  Gt+l.i)  may  not 
proceed  immediately  to  perform  a  convergence  check  after  computing  its  last  time 
point.  Instead,  it  must  suspend  execution,  to  be  requeued  later  by  task  (i.  i ). 

6.2  Data  Structures 

In  the  FWT  program,  a  subcircuit  may  be  active  in  only  one  iteration  at  a  time, 
due  to  the  use  of  the  augmented  task  graph  T .  In  time  point  pipelining,  a  subcircuit 
may  be  active  in  more  than  one  iteration  at  a  time  even  if  the  augmented  task  graph  is 
used.  In  fact.  Gauss-Jacobi  time  point  pipelining  requires  that  a  subcircuit  be  allowed 
to  be  active  in  different  iterations  simultaneously,  because  this  is  the  only  source  of 
additional  parallelism  which  is  exposed  by  time  point  pipelining  compared  to  the  full 


window  technique.  Gauss-Seidel  time  point  pipelining  also  benefits  from  allowing  a 
subcircuit  to  be  active  in  different  iterations  simultaneously,  even  if  the  circuit  is 
bidirectionally  coupled. 

The  reason  for  using  the  augmented  task  graph  in  FWT  was  to  avoid  the  necessity 
of  duplicating  data  structures  for  different  iterations  of  a  subcircuit.  In  TPP.  the  data 
structures  must  be  duplicated  regardless  of  which  form  of  the  task  graph  is  used. 
Therefore.  TPP  uses  the  unaugmented  form  of  the  task  graph,  which  offers  the  potential 
of  greater  parallelism  than  the  augmented  form.  The  PARASITE  estimates  of  Table  4.8 
indicate  that  the  unaugmented  graph  will  not  result  in  significantly  better  performance 
than  the  augmented  graph  when  only  8  processors  are  used  for  the  benchmark  circuits. 
If  more  processors  are  used,  the  benefits  of  the  unaugmented  graph  are  more  significant. 

In  order  to  allow  different  iterations  of  a  subcircuit  to  be  active  simultaneously, 
the  TPP  program  allocates  separate  copies  of  those  data  structures  which  are  used  and 
modified  during  the  simulation  of  a  subcircuit.  This  results  in  an  increase  in  the 
amount  of  required  memory  space  by  a  factor  of  k  for  this  class  of  data  structures  com¬ 
pared  to  the  FWT  program.  The  contents  of  some  data  structures  are  not  changed  dur¬ 
ing  the  simulation  of  subcircuits  and  these  structures  need  not  be  duplicated.  Conse¬ 
quently.  the  overall  increase  in  required  memory  space  is  less  than  a  factor  of  k. 

Those  data  structures  for  which  each  subcircuit  evaluation  task  is  allocated  a 
separate  copy  include 

(a)  space  for  the  matrix  and  vectors  representing  the  linearized  system  of  equations  on 
each  Newton  iteration; 

(b)  pointers  into  the  matrix  which  indicate  where  terms  computed  from  the  model 
equations  of  each  circuit  element  should  be  added  to  a  matrix  element; 


(c)  pointers  to  successor  and  predecessor  tasks: 

(d)  the  time  values  and  the  node  voltages,  currents,  and  charges  at  the  last  few  time 
points  as  required  by  the  integration  algorithm: 

(e)  the  TPP  control  variables;  and 

(f)  pointers  to  the  waveform  buffers  for  all  internal  nodes  and  source  waveforms 
which  affect  the  subcircuit  on  the  specific  iteration. 

The  subcircuit  element  values,  model  parameter  values,  and  other  invariant  data 
describing  a  subcircuit  are  maintained  in  a  single  copy  for  each  subcircuit. 

The  approach  of  preallocating  separate  data  structures  for  each  task  is  memory 
intensive  and  algorithmically  simple  compared  to  some  other  schemes  which  could  be 
used.  Approaches  which  are  more  conservative  of  memory  may  be  preferable  for  cases 
where  the  virtual  address  space  is  not  large  enough,  or  where  the  real  memory  size  is 
not  large  enough  and  the  resulting  page  misses  encountered  by  the  virtual  memory  sys¬ 
tem  cause  a  degradation  in  performance.  One  of  the  more  conservative  approaches  to 
memory  usage  for  the  subcircuit  matrices  will  be  briefly  outlined,  although  the 
approach  is  not  implemented  in  TPP. 

If  p  processors  are  used,  and  if  p  «nn ,  then  a  significant  reduction  in  memory 
usage  can  be  achieved  by  allocating  space  for  only  one  matrix  per  processor.  The  space 
for  each  matrix  must  be  large  enough  to  accommodate  the  largest  matrix  of  any  subcir¬ 
cuit.  When  a  subtask  executes  on  a  processor,  it  uses  the  matrix  space  associated  with 
that  processor.  This  is  feasible  since  no  data  stored  in  a  matrix  by  a  subtask  are  needed 
by  any  other  subtask:  when  a  subtask  finishes,  the  contents  of  its  matrix  may  be  dis¬ 
carded.  The  problem  with  this  approach  concerns  the  handling  of  pointers  to  matrix 
locations.  When  circuit  elements  are  evaluated,  their  contributions  are  loaded  into  the 
matrix  through  precomputed  pointers.  If  the  destination  matrix  is  not  known  in 
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advance  for  a  given  task,  then  either  p  sets  of  precomputed  matrix  pointers  are  needed, 
one  for  each  possible  destination  matrix,  or  the  pointers  can  be  computed  dynamically 
when  the  terms  are  added  to  the  matrix.  If  p  is  large,  then  the  use  of  p  sets  of  matrix 
pointers  results  is  too  great  of  an  additional  memory  requirement  compared  to  the 
memory  saved  by  eliminating  the  matrices.  The  dynamic  computation  of  matrix 
pointers  can  be  accomplished  by  adding  a  precomputed  offset  to  the  address  of  the 
matrix  origin  when  each  term  is  added  to  the  matrix.  In  this  case,  each  matrix  load 
operation  requires  one  extra  address  addition. 

Matrix  pointers  are  also  used  to  indicate  the  structure  of  matrices  stored  in  sparse 
form.  In  FWT  and  TPP.  these  pointers  are  computed  once  in  the  presimulation  phase  of 
the  program.  The  pointers  are  used  repeatedly  during  the  solutions  of  the  sparse  sys¬ 
tems  of  equations.  In  the  scheme  where  only  one  matrix  is  allocated  per  processor,  the 
pointers  representing  the  matrix  structure  could  be  determined  once  and  stored  in  a 
master  copy  of  the  matrix  for  each  subcircuit.  Then  when  a  task  needs  to  use  a  matrix, 
it  could  copy  its  matrix  structure  pointers  while  adding  an  offset  to  account  for  the 
location  of  the  specific  matrix  space  to  be  used  in  memory.  This  results  in  the  added 
requirement  of  one  copy  of  the  matrix  structure  for  each  subcircuit,  plus  a  computa¬ 
tional  cost  in  initializing  the  matrix  each  time  it  is  assigned  to  a  particular  processor.  In 
the  TPP  program,  these  extra  computations  are  avoided  by  using  the  more  memory 
intensive  approach  of  preallocating  one  matrix  for  each  subcircuit  evaluation  task.  The 
matrix  pointers  for  each  copy  of  the  matrix  are  computed  once  in  the  presimulation 
phase  of  the  program,  and  they  are  used  repeatedly  in  each  iteration  group. 


(J  Results 


The  performance  of  the  TPP  program  was  measured  for  the  benchmark  circuits, 
and  the  results  are  presented  in  Table  6.1  for  the  Gauss-Seidel  method  and  Table  6.2  for 
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the  Gauss-Jacobi  method.  The  results  on  8  processors  are  compared  with  the 
PARASITE  estimates  in  Table  6.3.  The  TPP  results  are  within  21%  of  the  ideal 
PARASITE  estimates,  reflecting  the  higher  degree  of  overhead  compared  to  the  FWT 
results  which  are  within  1 1%  of  the  ideal  speedups.  The  comments  in  Chapter  5  con¬ 
cerning  the  factors  which  influence  the  comparison  with  the  PARASITE  estimates  are 

Table  6.1.  TPP  Gauss-Seidel  Speedups 


Table  6.2.  TPP  Gauss-Jacobi  Speedups.  Normalized  to  Gauss-Seidel 
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applicable  here  as  well. 

U  Presimulation  Selection  of  TPP  or  FWT 

The  fastest  method  as  a  function  of  the  number  of  processors  is  given  in  Table  6.4. 
based  on  the  results  in  Tables  5.1,  5.2,  6,1.  and  6.2.  In  Chapter  5  it  was  shown  that  the 
presimulation  speedup  estimates  can  be  used  to  select  between  the  Gauss-Seidel  and 
Gauss-Jacobi  methods  when  the  full  window  technique  is  used.  Time  point  pipelining 
has  at  least  as  much  available  parallelism  as  the  full  window  technique,  and  any 
presimulation  selection  technique  that  does  not  take  overhead  factors  into  consideration 
will  always  favor  time  point  pipelining  over  the  full  window  technique.  However,  it  is 
apparent  from  Table  6.4  that  the  full  window  technique  is  sometimes  faster  than  time 
point  pipelining  because  of  its  lower  overhead.  For  best  performance,  time  point  pipe¬ 
lining  should  only  be  used  in  those  cases  where  the  full  window  technique  does  not 
have  sufficient  parallelism  to  keep  the  processors  very  busy.  This  suggests  that  esti¬ 
mates  of  processor  utilization  for  the  full  window  technique  may  be  useful  in  selecting 
between  the  full  window  technique  and  time  point  pipelining. 


Table  6.4.  Fastest  Method  vs.  Number  of  Processors 


Circuit 
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A  procedure  is  now  presented  for  selecting  the  best  of  the  four  methods  prior  to 
the  simulation  of  a  given  circuit  on  a  given  number  of  processors.  First,  the  choice 
between  the  Gauss-Seidel  and  Gauss-Jacobi  relaxation  methods  is  made  based  on  the 
estimate  of  pb  as  described  in  Chapter  5.  Next,  the  potential  processor  utilization  for 
the  full  window  technique  is  computed  by  dividing  the  speedup  estimate  by  the  number 
of  processors.  If  the  potential  utilization  is  larger  than  some  threshold  ub  .  then  the  full 
window  technique  is  used:  otherwise  time  point  pipelining  is  used.  The  procedure  is 
specified  in  detail  in  the  following  algorithm.  The  circuit  is  assumed  to  be  given,  p  is 
the  number  of  processors  to  be  used,  and  ub  is  a  predetermined  constant. 

Algorithm  6.4.  Presimulation  Selection 

if  (p=l){ 

rjnethod  —GS 
p  method  *~FWT 

i 

else  { 

Compute  ^OS.ta  .  the  Type  3  speedup  estimate  for  Gauss-Seidel  using 
the  full  window  technique,  assuming  a  2-iteration  augmented 
task  graph  and  unlimited  processors. 

Pb.ttl  *~^GS.en  ^-7 

if  (p  <  pb  t„  )  rjnethod  —GS 

else  rjnethod  — GJ 

Compute  Sr  «„  .  the  Type  3  speedup  estimate  for  the  full  window 
technique  using  the  relaxation  method  specified  by  rjnethod  . 
assuming  a  2-iteration  augmented  task  graph  and  unlimited 
processors. 

U  *~^r  muhod  ,ea  IP 

if  (u^ub)  pjnethod  —FWT 
else  p  method  —TPP 

) 

At  the  conclusion  of  the  algorithm,  the  relaxation  method  is  given  by  rjnethod .  and  the 
choice  between  the  full  window  technique  and  time  point  pipelining  is  given  by 


pjnethod . 
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The  potential  utilization  represented  by  u  is  the  fraction  of  time  that  the  proces¬ 
sors  would  have  to  be  busy  to  achieve  speedup  Srmtt/)0t/  on  p  processors,  where 
Sr  mahod  ta  **  the  speedup  of  the  full  window  technique  assuming  an  unlimited  supply 
of  processors.  If  u  <1.  then  the  processors  will  only  be  active  part  of  the  time.  In  this 
case,  the  use  of  time  point  pipelining  is  appropriate  to  increase  the  processor  utilization 
and  the  speedup.  If  u  is  slightly  greater  than  1.  then  the  processors  will  be  busy  all  the 
time  if  the  computational  load  is  perfectly  balanced  between  the  different  processors. 
However,  the  nonuniformities  in  task  sizes  and  precedence  constraint  patterns  result  in 
imperfect  load  balancing.  Consequently,  there  will  still  be  certain  times  when  the  pro¬ 
cessors  run  out  of  available  tasks.  Therefore,  time  point  pipelining  should  be  beneficial 
in  this  case  also.  If  u  » 1  then  the  full  window  technique  generates  much  more  con¬ 
current  work  than  the  processors  can  handle,  and  consequently  time  point  pipelining 
should  not  be  used  because  it  will  just  add  more  overhead.  The  break-even  value  of  u  , 
where  the  full  window  technique  and  time  point  pipelining  have  about  the  same  perfor¬ 
mance  is  given  by  ub .  The  value  of  ub  is  expected  to  be  greater  than,  but  not  much 
greater  than  1. 

In  order  to  determine  if  there  exists  a  value  of  ub  which  would  result  in  correct 
selections  of  the  fastest  algorithms,  the  values  of  u  have  been  tabulated  in  Table  6.5. 
for  the  indicated  relaxation  methods.  The  value  u6=1.7  is  found  to  produce  good 
results.  With  this  choice  of  ub.  the  presimulation  selection  algorithm  chooses  the 
methods  shown  in  Table  6.6.  In  all  cases  except  for  one.  the  fastest  method  (or  one 
with  nearly  the  same  performance)  is  chosen.  In  the  one  exceptional  case,  the  chosen 
method  is  8%  slower  than  the  fastest  method. 

These  results  demonstrate  that  a  simple  analysis  of  the  task  graphs  can  be  used  to 
get  some  indication  of  the  relative  performance  of  different  methods  prior  to 


Table  6.5.  Presimulation  FWT  Potential  Utilization  Estimates 


Circuit 

Processors 

1  2  4  S 

dvs 

dpla 

scdac 

ben2k 

digfi 

1.5  (S)  0.8  (S)  1.6  (J)  0.8  (J) 

2.0  (S)  1.0  (S)  1.6  (J)  0.8  (J) 

7.0  (S)  3.5  (S)  1.8  (S)  0.9  (S) 

2.3  (S)  1.2  (S)  1.8  (J)  0.9  (J) 

5.0  (S)  2.5  (S)  1.3  (S)  4.7  (J) 

Key:  S-Gauss-Seidel.  J-Gauss-Jacobi 

Table  6.6.  Presimulation  Prediction  of  Fastest  Metbod 
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2  4 
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TS 
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scdac 
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FS 
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Tj* 

digfi 

FS 

FS 

TS 

FJ 

Key:  S-Gauss-Seidel.  J-Gauss-Jacobi. 
•-Disagrees  with  Table  6.4 

simulation.  Details  of  the  presimulation  selection  algorithm  may  have  to  be  modified  if 

significant  changes  are  made  in  the  implementation  details  of  the  simulation  algorithms  j 

i 

I 

or  in  the  types  of  circuits  being  simulated.  In  particular,  the  Gauss-Jacobi/Gauss-Seidel  ! 

l 

I 

ratio  of  0.7  used  in  the  pb  estimate,  and  the  ub  value  of  1.7  are  empirically  determined  ( 

constants  obtained  based  on  the  implementations  in  the  FWT  and  TPP  programs  as  j 

applied  to  the  benchmark  circuits.  For  the  benchmark  circuits,  which  represent  a  < 

variety  of  types  of  MOS  digital  circuits,  the  presimulation  selection  method  produces 
good  results  using  the  previously  mentioned  constants. 
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CHAPTER  7 


a 


LATENCY 


The  number  of  computations  required  to  solve  a  set  of  circuit  equations  over  a 
time  interval  can  be  reduced  by  exploiting  latency.  Latency  exploitation  has  been  used 
in  a  variety  of  forms  in  several  circuit  simulators  [Nag75,  Yan80.  Whi86,  Sal87a, 
SakSl.  Cox87.  Rab76].  Latency  is  exploited  by  recognizing  situations  in  which  a  new 
solution  point  will  match  some  previous  solution  point.  If  this  determination  can  be 
done  cheaply  in  advance  of  actually  solving  for  the  new  solution  point,  then  the  com¬ 
putations  to  solve  for  the  new  solution  point  can  be  avoided.  Two  types  of  latency  are 
addressed  in  RELAX2.3:  time  latency  and  iteration  latency.  This  chapter  considers  the 
use  of  these  types  of  latency  in  the  parallel  waveform  relaxation  algorithms. 

7.1  Time  Latency 

A  subcircuit  is  said  to  be  time  latent  in  time  interval  [f  l.t2]  if  the  inputs  to  the 
subcircuit  are  constant  in  the  interval  and  the  subcircuit  is  in  a  dc  steady  state  at  time 
t  ,.  In  this  situation  it  is  not  necessary  to  compute  the  voltages  for  any  time  point  in  the 
interval  ( t  2]  since  the  values  will  be  identical  to  those  at  t  v  The  s-bcircuit  is  deter¬ 
mined  to  be  in  a  dc  steady  state  at  t  x  by  observing  that  the  subcircuit's  internal  voltages 
and  charges  are  constant  and  the  currents  flowing  into  capacitors  are  zero,  within  some 
tolerance.  Once  this  determination  is  made,  the  end  of  the  time  latency  interval  can  be 
found  by  searching  for  the  time  at  which  the  first  change  occurs  in  an  input  waveform 
of  the  subcircuil. 
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Even  without  explicitly  handling  time  latency,  the  basic  waveform  relaxation 
algorithm  automatically  exploits  time  latency  and  slowly  changing  signals  by  choosing 
the  time  steps  independently  for  each  subcircuit.  Therefore,  the  advantage  of  explicitly 
managing  time  latency  is  not  as  great  for  waveform  relaxation  compared  to  other  solu¬ 
tion  techniques  which  do  not  use  multirate  integration.  The  time  latency  algorithm 
may  be  more  efficient  in  placing  the  next  time  point  exactly  at  the  point  in  time  where 
the  first  input  change  occurs,  whereas  the  time  step  computed  based  on  the  truncation 
error  does  not  look  ahead  to  future  input  changes,  except  indirectly  through  the  3-step 
check  (see  Algorithm  6.2). 

A  danger  of  time  latency  exploitation  is  that  false  latency  detection  can  occur  if 
the  tolerances  are  too  large.  This  is  especially  important  in  linear  circuits,  where  decay¬ 
ing  exponential  waveforms  satisfy  the  latency  criteria  before  actually  reaching  their 
final  values.  In  digital  circuits,  the  likelihood  of  false  latency  detection  is  reduced 
because  of  the  sharp  nonlinearities  which  clamp  signals  to  steady  state  values  relatively 
quickly  after  transitions. 

7.2  Iteration  Latency 

Subcircuit  i  is  said  to  be  iteration  latent  on  iteration  k  if  all  the  input  waveforms 
of  subcircuit  evaluation  task  (.k.i )  are  identical  to  the  inputs  of  task  (k—  l.i ).  within 
some  tolerance.  In  this  situation  it  is  not  necessary  to  solve  the  subcircuit  on  iteration 
k  since  the  solution  will  be  almost  identical  to  the  solution  of  iteration  k  —  1.  In  previ¬ 
ous  work,  iteration  latency  has  been  called  partial  waveform  convergence.  This  is 
because  iteration  latency  most  commonly  occurs  when  some  of  the  subcircuits  converge 
to  the  solution  while  the  rest  of  the  circuit  requires  additional  iterations.  After  the  first 
subcircuits  converge,  they  become  iteration  latent  for  the  remaining  iterations  while  the 
other  subcircuits  continue  to  be  evaluated. 


In  the  Gauss-Jacobi  method,  a  signal  change  originating  in  one  subcircuit  may 
require  some  number  of  iterations  j  to  reach  subcircuit  i .  because  signals  propagate 
through  the  subcircuit  graph  at  the  rate  of  one  subcircuit  per  iteration.  If  subcircuit  i 
is  in  a  dc  steady  state  at  the  beginning  of  the  window,  it  will  be  iteration  latent  for 
iterations  1  through  y— 1.  In  this  scenario,  the  terminology  partial  waveform,  conver¬ 
gence  is  inappropriate,  because  the  latency  is  not  due  to  convergence.  The  terminology 
iteration  latency  is  more  basic,  and  covers  both  of  the  situations  described  above. 

Iteration  latency  can  occur  over  an  entire  window,  or  over  the  first  part  of  a  win¬ 
dow.  Let  lk  and  Ik  be  the  sets  of  input  waveforms  to  the  subcircuit  when  the  subcir¬ 
cuit  is  solved  on  iterations  k  —1  and  k  .  respectively.  If  the  window  is  [ta.  tb  ].  then  the 
subcircuit  can  be  iteration  latent  in  subinterval  [ta.  tx  ].  for  some  tx  ^tb .  If  tx  <tb  .  then 
tx  is  the  last  time  point  in  the  window  for  which  the  waveforms  of  lk  match  those  of 
A  -i  within  the  specified  tolerance.  The  subcircuit  solution  of  the  previous  iteration  can 
be  used  in  the  interval  [ta.tx].  whereas  solution  points  in  the  interval  Ux.tb  ]  must  be 
recomputed  on  iteration  k.  The  final  waveforms  for  the  window  are  equivalent  to 
those  that  would  have  been  computed  if  the  Ik  M  input  waveforms  were  applied  to  the 
subcircuit  through  time  tx .  and  the  Ik  input  waveforms  were  applied  after  tx .  The 
discontinuity  of  the  effective  input  waveforms  at  tx  can  cause  problems  if  the  tolerance 
used  to  compare  the  input  waveforms  is  too  large.  Artificial  glitches  may  occur  in  the 
output  waveforms  and  excessive  time  points  may  be  computed  after  the  discontinuity 
due  to  the  glitches.  In  extreme  cases,  one  of  these  discontinuities  can  lead  to  a  failure  of 
the  integration  algorithm,  as  the  step  size  control  mechanism  repeatedly  reduces  the  step 
size  in  a  vain  attempt  to  reach  an  acceptable  truncation  error  computed  from  divided 
difference  estimates  of  derivatives  of  discontinuous  functions. 


RELAX2.3  uses  iteration  latency  when  the  subcircuit  is  latent  over  the  entire  win¬ 
dow.  but  not  when  the  subcircuit  is  latent  over  only  the  first  part  of  the  window.  The 
likelihood  of  anomalous  behavior  due  to  input  waveform  discontinuities  is  reduced  in 
this  case,  since  the  number  of  iteration  latency  boundaries  is  greatly  reduced,  and  a  sub¬ 
circuit  is  more  likely  to  be  in  a  dc  steady  state  at  a  window  boundary,  although  many 
window  boundaries  occur  during  signal  transitions  in  some  of  the  benchmark  circuits. 
Consequently,  the  policy  of  using  iteration  latency  only  when  the  subcircuit  is  latent 
over  the  entire  window  reduces  the  likelihood  of  latency-induced  errors,  but  does  not 
assure  that  such  errors  will  not  arise. 

Latency-induced  errors  can  be  made  as  small  as  desired  by  decreasing  the  latency 
tolerances.  However,  very  small  tolerances  lead  to  very  little  latency  exploitation.  In 
effect,  the  use  of  iteration  latency  and  the  setting  of  the  tolerances  present  a  tradeoff  of 
simulation  speed  versus  reliability  and  accuracy.  The  latency  algorithms  of  RELAX2.3 
produce  acceptably  accurate  waveforms  for  all  the  benchmark  circuits,  using  iteration 
latency  tolerances  which  are  identical  to  the  waveform  convergence  tolerances. 

73  Impact  of  Latency  on  Speedup 

Latency  exploitation  reduces  the  number  of  computations  which  must  be  per¬ 
formed.  and  therefore  reduces  the  uniprocessor  run  time.  rv  If  some  of  the  eliminated 
computations  are  in  the  critical  path  of  the  task  graph,  then  the  parallel  run  time  will 
also  be  reduced,  provided  the  overhead  of  latency  management  is  not  too  large.  The 
parallel  completion  time  of  the  program  on  an  unlimited  number  of  processors  without 
overhead  is  given  by 

^  (7.1) 

*€  Ituki  in  P I 

where  wx  is  the  CPU  time  of  task  x  .  and  P  is  a  path  in  the  task  graph  which  maximizes 
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the  expression  in  (7.1).  If  the  overhead  for  detecting  latency  is  negligible,  then  wx  for 
each  task  x  will  be  reduced  or  remain  the  same  when  latency  is  exploited.  Conse¬ 
quently.  latency  exploitation  will  reduce  the  overall  parallel  run  time,  or  at  worst  leave 
it  unchanged,  neglecting  the  overhead  required  to  manage  the  latency. 

The  speedup  on  unlimited  processors  neglecting  overhead  is  £(/.<x>sri/Too-  Since 
both  rt  and  are  reduced  when  latency  is  exploited,  the  speedup  may  increase  or 
decrease.  If  the  tasks  in  P  exhibit  more  latency  than  other  tasks,  then  the  speedup  will 
tend  to  increase  when  latency  is  exploited.  Otherwise,  the  speedup  will  tend  to  be 
reduced.  For  circuits  that  partition  into  subcircuits  of  nonuniform  size,  there  is  a  ten¬ 
dency  for  P  to  consist  of  subcircuit  evaluation  tasks  of  the  larger  subcircuits.  This  is 
especially  true  for  Gauss-Jacobi.  where  P  typically  contains  instances  of  1  or  2  of  the 
largest  subcircuits.  Subcircuit  evaluation  tasks  for  large  subcircuits  typically  have 
many  input  waveforms  from  other  subcircuits.  The  likelihood  that  all  of  the  inputs 
will  match  the  previous  iteration,  or  that  all  the  inputs  will  be  constant  in  some  time 
interval  is  small  by  comparison  with  smaller  subcircuits.  This  intuitive  argument  sug¬ 
gests  that  latency  exploitation  will  tend  to  reduce  speedup,  when  the  number  of  proces¬ 
sors  is  large,  and  when  the  subcircuit  partitioning  is  nonuniform. 

As  a  counter-example,  consider  a  circuit  which  during  a  given  time  window  con¬ 
sists  of  two  essentially  independent  parts,  one  part  consisting  of  a  few  small  tightly 
coupled  subcircuits  and  the  other  consisting  of  large  loosely  coupled  subcircuits.  Since 
the  small  subcircuits  are  tightly  coupled,  they  will  required  more  iterations  than  the 
large  subcircuits,  and  the  large  subcircuits  will  become  iteration  latent  after  a  few 
iterations.  In  this  case,  the  large  subcircuits  in  the  critical  path  will  exhibit  greater 
latency  than  the  average  subcircuit,  and  the  speedup  may  be  increased  by  latency 
exploitation.  Although  this  situation  is  less  likely  than  the  previous  example,  it  demon- 
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struts  that  latency  exploitation  does  not  necessarily  reduce  the  amount  of  available 
parallelism. 

The  effectiveness  of  time  point  pipelining  is  also  affected  by  latency  exploitation. 
Suppose  P  is  a  path  in  the  full  window  task  graph  consisting  of  relatively  large  subcir¬ 
cuit  evaluation  tasks  x  l . x4.  and  that  time  points  occur  at  times  t , .  t  s  in  each 

task.  The  first  task  cannot  be  iteration  latent,  since  there  is  no  previous  iteration.  Sub¬ 
sequent  tasks  in  the  path  will  tend  to  be  iteration  latent  over  increasing  portions  of  the 
first  part  of  the  window.  For  example,  suppose  x2  is  nonlatent.  x3  is  iteration  latent 
through  1 2.  and  x4  is  iteration  latent  through  t4.  If  the  latency  is  not  exploited,  if  4 
processors  are  available,  and  if  only  the  tasks  in  path  P  are  considered,  then  the  parallel 
execution  of  time  points  will  proceed  as  shown  in  Table  7.1.  and  will  require  S  units  of 
processor  time,  compared  to  20  units  of  time  on  a  single  processor.  If  partial-window 
latency  is  exploited,  the  parallel  execution  will  proceed  as  shown  in  Table  7.2.  and  will 
require  the  same  amount  of  processor  time.  8  units.  However,  the  uniprocessor  time  is 
reduced  to  14  by  exploiting  the  latency.  Consequently,  in  this  example,  iteration 
latency  exploitation  reduces  the  uniprocessor  run  time  but  leaves  the  4-processor  run 


Table  7.2.  Parallel  Schedule  Example:  Latency 


Time 

x , 

*2 

XA 

1 

r, 

2 

t. 

r, 

3 

t. 

ty 

4 

td 

tl 

5 

t, 

tA 

'j 

6 

ts 

*4 

7 

h 

8 

time  unchanged. 

The  parallel  execution  time  will  be  reduced  by  latency  if  one  or  more  of  the  tasks 
in  P  are  iteration  latent  over  the  entire  window.  For  example,  if  x  j  is  iteration  latent 
over  the  entire  window,  then  x4  may  compute  each  nonlatent  time  point  one  step  ear¬ 
lier.  As  noted  for  the  full  window  technique,  the  critical  paths  in  the  task  graph  tend 
to  contain  large  tasks  which  are  less  likely  to  be  latent  over  the  entire  window  than 
other  tasks.  Therefore,  iteration  latency  exploitation  typically  reduces  the  amount  of 
parallelism  in  time  point  pipelining. 

7.4  FWT  Latency  Implementation 

The  window-level  exploitation  of  iteration  latency  in  RELAX2.3  is  readily  imple¬ 
mented  in  the  FWT  program,  because  the  entire  input  waveforms  are  available  when  a 
subcircuit  evaluation  task  is  started.  Since  FWT  uses  the  augmented  task  graph  T .  the 
entire  input  waveforms  of  task  (*  — l.i  )  as  well  as  the  inputs  of  (*.  i  )  are  guaranteed 
to  be  available  when  task  (k,i )  begins  executing.  Consequently,  iteration  latency  can 
be  checked  for  the  entire  window  before  computing  the  first  time  point  of  the  subcir¬ 
cuit.  and  if  latency  is  detected,  no  time  points  are  computed.  If  the  subcircuit  is  not 


iteration  latent,  time  latency  is  checked  before  computing  each  time  point.  When  time 
latency  is  detected,  the  input  waveforms  may  be  searched  immediately  for  the  next 
change  in  value,  which  determines  the  next  time  point  at  which  the  subcircuit  must  be 
evaluated. 

The  implementation  of  latency  in  FWT  affects  only  the  subcircuit  evaluation  task 
algorithm,  which  is  given  below  in  Algorithm  7.1.  The  current  window  boundaries  are 
represented  by  ta  and  tb .  the  iteration  number  is  k  .  and  the  subcircuit  number  is  i . 
The  convergence  and  successor  checks  are  the  same  as  in  Algorithm  5.7. 

Algorithm  7.1.  FWT  Subcircuit  Evaluation  Task  with  Latency  U,  i ) 

determine  time  t  £tb  such  that  the  inputs  to  (.k.i  )  match  the  previous 
iteration  on  [r4 .  t  ] 

if  (t  =r4  )  Vj(*  \t  ~1}(t ).  for  all  t  €[«,.  tt  ] 
else  { 

tjione  —tM 

pick  next  time  point  tjiext 
while  ( tjione  ktb  )  { 

if  ((k.  i )  is  in  dc  steady  state  at  time  tjdont )  { 

determine  time  l  %tb  such  that  the  inputs  to  (k.  i ) 
are  constant  on  [tjione,  t  ] 
v,(i  )*-v. (*  Xtjiont  ) 

} 

else  solve  subcircuit  at  time  t  €(t~don*.t_next  ] 
tjione  —t 

pick  next  time  point  t  next 

} 

I 

check  successors 
check  convergence 

7.5  TPPL  Latency  Implementation 

Time  point  pipelining  presents  unique  problems  for  the  exploitation  of  time 
latency  and  iteration  latency.  Problems  arise  because  only  partial  waveforms  are  gen¬ 
erally  available  when  time  points  are  computed,  and  because  long  latent  intervals  can 


cause  pipelining  bottlenecks. 

When  tine  latency  is  detected  at  time  t .  it  may  not  be  possible  to  immediately 
determine  the  next  time  for  which  the  task  should  be  evaluated,  because  the  input 
waveforms  may  not  be  available  through  the  time  of  the  next  input  change.  Conse¬ 
quently.  it  may  be  necessary  to  mark  the  task  as  being  time  latent  and  suspend  its  exe¬ 
cution.  A  mechanism  must  then  be  provided  for  the  task  to  be  restarted  when  one  of 
its  inputs  changes  value.  Note  that  this  condition  for  restarting  a  time  latent  task  is 
different  from  the  condition  for  restarting  a  nonlatent  task.  In  a  nonlatent  task,  the 
time  of  the  next  time  point,  tjtext  is  determined  before  the  task  is  suspended,  and  the 
predecessors  restart  the  task  after  they  have  all  progressed  through  time  tjwxt .  regard¬ 
less  of  whether  or  not  their  waveforms  change  value. 

Bottlenecks  can  occur  if  time  advancements  are  not  propagated  through  time 
latent  tasks.  For  example,  consider  a  task  x  with  a  set  of  predecessors  P  and  a  set  of 
successors  5 .  Suppose  that  x  is  time  latent  on  the  interval  If  t  J.  and  that  the  tasks  in 
P  and  S  will  all  be  evaluated  at  times  r  j.  f  2.  •  •  •  ,tm  This  is  not  an  unreason¬ 

able  situation,  because  tasks  in  5  may  also  be  immediate  successors  of  tasks  in  P .  and 
the  waveforms  fed  from  P  to  S  may  be  active  even  though  those  waveforms  fed  from 
P  to  x  may  be  unchanging.  Task  x  does  not  compute  any  time  points  in  the  interval 
(fv.tj  because  it  is  time  latent.  However,  if  the  output  waveforms  of  x  are  not 
updated  until  the  time  point  is  reached  at  the  end  of  the  latent  interval,  at  t m.  then  the 
tasks  in  S  will  not  be  able  to  compute  the  time  point  at  t  j  until  after  the  tasks  in  P 
have  progressed  all  the  way  to  time  tM.  This  bottleneck  can  be  avoided.  If  all  tasks  in 
P  have  progressed  to  time  t,  and  x  is  still  time  latent,  then  the  waveforms  of  x  can  be 
extended  to  time  1 1  without  computing  a  new  solution  point,  and  the  successors  of  x 
can  then  be  allowed  to  compute  new  solutions  at  time  t In  terms  of  the  TPP  control 


variables,  the  bottleneck  is  avoided  by  having  tjione  of  the  latent  task  track  the  value 
of  tjrtady ,  and  by  periodically  updating  t_ready  to  track  the  progress  made  by  the 
predecessor  tasks  during  the  latent  interval. 

If  time  latency  is  not  handled  explicitly,  and  the  time  steps  are  determined 
exclusively  based  on  local  truncation  error,  then  time  advancements  cannot  be  pro¬ 
pagated  through  latent  tasks  as  described  above.  Therefore,  the  explicit  detection  of 
time  latency  has  a  beneficial  influence  on  parallelism. 

The  use  of  iteration  latency  in  time  point  pipelining  presents  a  more  fundamental 
problem  than  time  latency.  The  window-level  approach  to  iteration  latency  used  in 
FWT  is  not  particularly  suitable  for  time  point  pipelining.  Suppose  that  the  window- 
level  latency  approach  were  used  for  task  x  with  predecessors  P  and  successors  S .  The 
determination  that  x  is  iteration  latent  cannot  be  made  until  all  of  x's  input 
waveforms  are  available  over  the  entire  window.  In  this  case  the  output  waveforms  of 
x  are  not  known  until  all  the  tasks  in  P  reach  the  end  of  the  window,  which  means 
that  all  the  tasks  in  5  will  be  delayed  until  the  tasks  in  P  are  finished.  If  latency  were 
not  exploited,  the  time  points  would  propagate  through  x  one  at  a  lime,  allowing  tasks 
in  S  to  be  executing  concurrently  with  tasks  in  P .  Thus,  window-level  latency  exploi¬ 
tation  has  a  serious  negative  impact  on  the  pipelining  parallelism. 

Similar  bottlenecks  appear  in  the  window-level  latency  approach  even  when  tasks 
are  not  iteration  latent  over  the  entire  window.  If  task  x  is  iteration  latent  on  [ta.t], 
for  some  t  <tA .  then  the  determination  that  x  is  not  latent  on  the  window  cannot  be 
made  until  all  the  tasks  in  P  reach  time  t .  Only  then  can  x  compute  its  first  time 
point.  If  t  »ta.  the  delay  in  starting  to  solve  task  x  will  result  in  a  significant 
decrease  in  parallelism.  This  parallelism  reduction  will  be  comparatively  small  in  early 
iterations  since  the  iteration  latency  boundaries  will  be  close  to  tg  .  but  the  parallelism 
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reduction  will  be  comparatively  large  in  later  iterations  as  the  iteration  latency  boun¬ 
daries  approach  tb . 

By  allowing  iteration  latency  to  be  exploited  on  partial  windows,  the  time  point 
pipelining  bottlenecks  described  above  can  be  avoided.  When  partial  window  latency  is 
used,  the  latency  status  of  a  task  can  be  determined  time  point  by  time  point  as  the 
input  waveforms  advance.  For  example,  if  the  input  waveforms  to  task  x  are  available 
through  time  t .  and  they  match  the  previous  iteration  through  time  t .  then  the  output 
waveforms  of  task  x  for  the  current  iteration  can  be  copied  from  the  previous  iteration 
through  time  t  and  these  waveforms  can  be  made  available  to  S .  If  at  time  tx  >t  one 
of  the  inputs  to  x  differs  from  the  previous  iteration,  then  x  becomes  nonlatent  at  time 
tx  and  begins  computing  solution  points.  Just  as  in  the  time  latency  case,  it  is  impor¬ 
tant  to  propagate  time  advancements  through  iteration  latent  tasks  to  avoid  bottlenecks 
on  long  latent  intervals. 

The  detection  of  iteration  latency  requires  comparing  the  current  input  waveforms 
with  the  input  waveforms  of  the  previous  iteration.  Therefore,  when  checking  iteration 
latency  at  time  t ,  it  is  necessary  that  the  input  waveforms  of  the  previous  iteration  be 
available  through  time  t .  In  order  to  guarantee  that  the  previous  iteration  waveforms 
will  be  available,  the  augmented  task  graph  T  must  be  used,  rather  than  the  unaug¬ 
mented  task  graph  used  in  the  TPP  program  without  latency.  The  PARASITE  results  of 
Chapter  4  indicate  that  the  penalty  for  using  T  is  small  except  when  the  number  of 
processors  is  very  large. 

Time  latency  and  iteration  latency  have  been  implemented  in  an  experimental  ver¬ 
sion  of  the  TPP  program  which  bears  the  unimaginative  but  functional  name  TPPL. 
The  algorithm  for  the  subcircuit  evaluation  task  is  outlined  below  in  Algorithms  7.2  - 
7.6.  The  TPP  control  variables  retain  their  meanings  as  defined  in  Chapter  6,  except  the 
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status  variable  can  take  on  additional  values  to  indicate  when  a  task  is  iteration  latent 


or  time  latent.  When  the  TPP  control  variables  appear  without  subscripts,  the  sub¬ 
scripts  k.  i  are  assumed.  The  value  of  S  is  smaller  than  the  smallest  possible  time  step. 

Algorithm  7.2.  TPPL  Subcircuit  Evaluation  Taak  (k.i) 

more  —TRUE 
while  (more  *TRUE )  { 

if  ( status  ^UNINITIALIZED  )  tppljnit  (k.  i ) 

else  if  ( status  * EVALUATE  )  tppljevtd  (*.  * ) 

else  if  (status  =  ITER_LA  TENT  )  iter  Jot  ( k.i ) 

else  if  (status  -T1MEJLATENT  )  time  lot  (k.i) 

else  if  (status  ~ENDJTER_LA TENT )  endjterjal  (k.  i ) 

else  if  (status  =END_T I ME_LA  T ENT  )  endjimejat  (k.  i  ) 

else  if  (status  * CONVERGE.CHECK )  { 

if  (global  convergence  achieved  and  tb<tf) 
start  next  window 

more  —FALSE 

) 

if  (more  =TRUE)\ 

LOCKOoc*) 

if  (status  * EVALUATE )  limit  tjxext 
update  t_ready  and  waiting-/  or 
if  (waiting_f  or  re  NULL  )  more  —FALSE 
UNLOCKOoc*  ) 

I 


Algorithm  7.3.  Tppl.eval  (k.  i ) 

while  (status  *. EVALUATE  and  tj-eody  ^  tjxext  )  { 
if  (tjone  >t„  )  status  =CONVERGEjCHECK 
else  if  ((k.  i )  is  in  dc  steady  state) 

determine  time  t  ^t -ready  such  that  the  inputs  to  (k.i )  are 
constant  on  [t_done.  t  ] 
if  (t  >  tjxext )  | 

t  j-tjione 
tjdone  —t 

tjxext  ♦~min{f  +S,  tb  1 
status  -TIME-LATENT 
check  successors 


} 


if  {status  = EVALUATE  )  { 

solve  subcircuit  at  time  t  {tjdone.  tjiext  ] 
tjdone  *-t 

pick  next  time  point  tjiext 
check  successors 

} 

) 

Algorithm  7  A.  Iter_lat  (k.  i ) 

determine  time  t  < thready  such  that  the  inputs  t o  (k.i)  match  the  previous 
iteration  on  [ta.  t  ] 
if  (t  >  tjdone  )  I 
tjlone  *-t 

tjiext  «-min{t  +8.  tb  } 
check  successors 

} 

if  ( tjdone  —tb  or  tjdone  Ktjeady  )  status  «- END_IT ER_LA  T ENT 
else  tjiext  «-min{r  +8,  tb } 

Algorithm  7.5.  Time_lat  (k.  i ) 

determine  time  t  < tjeady  such  that  the  inputs  to  (*.  i )  are  constant 
on  [tjdone.  t  ] 
if  (f  >tjdone  )  { 
tjlone  *-t 

tjiext  *-min{f  +8.  tb  } 
check  successors 

} 

if  ( tjdone  —tb  or  tjdone  <tjeady  )  status  *-END_TIME_LATENT 

Algorithm  73.  End_iter_lat  (k,  i ) 

v/*  \t  -1)(r  ).  for  all  t  €{fa,  tjdone } 
pick  next  time  point  tjiext 
status  EVALUATE 

Algorithm  7 A.  End_time_lat  ( k.i ) 

v,(<  \t_done  *(r ,) 

pick  next  time  point  tjiext 
status  EVALUATE 


1A  Results 


The  effect  of  iteration  and  time  latency  exploitation  on  a  uniprocessor  is  demon¬ 
strated  in  Table  7.3.  The  run  times  using  Gauss-Seidel  waveform  relaxation  with  and 
without  latency  are  listed  along  with  the  run  times  of  the  standard  direct  method.  The 
latency  results  were  obtained  using  time  latency  and  window-level  iteration  latency. 
Two  observations  are  apparent:  latency  has  a  significant  impact  on  the  run  time,  and 
waveform  relaxation  is  not  as  fast  as  the  direct  method  for  most  of  the  circuits  in  the 
benchmark  set.  The  exception  is  digji.  which  is  the  largest  and  most  uniformly  parti¬ 
tioned  of  the  circuits,  and  has  comparatively  weak  coupling  between  subcircuits.  It 
should  be  noted  that  the  benchmark  circuits  were  obtained  primarily  from  real  indus¬ 
trial  circuits,  and  they  were  not  prescreened  on  the  basis  of  suitability  for  waveform 
relaxation,  aside  from  the  fact  that  only  MOS  circuits  were  considered.  Also,  the 
automatic  partitioning  and  windowing  algorithms  were  used  without  manual  optimiza¬ 
tion.  Results  have  been  cited  in  the  literature  which  demonstrate  that  waveform  relax¬ 
ation  is  capable  of  speeds  greater  than  the  direct  method  on  a  uniprocessor  for  a  variety 
of  circuits  [Whi86].  In  the  context  of  this  chapter,  the  important  point  demonstrated 
by  Table  7.3  is  that  the  use  of  latency  is  important  in  making  the  uniprocessor  perfor¬ 
mance  of  waveform  relaxation  to  be  as  competitive  as  possible  with  alternative 


Table  7.3.  Uniprocessor  Run  Times  (in  seconds) 


Circuit 

Direct 

Method 

WR-GS 

WR-GS  with 
Latency 

dvs 

57 

127 

97 

126 


algorithms. 

As  previously  noted,  the  effect  of  time  latency  in  waveform  relaxation  is  small 
because  waveform  relaxation  is  intrinsically  a  multirate  integration  algorithm.  The 
impacts  of  iteration  and  time  latency  are  shown  separately  in  Table  7.4.  where  it  is  seen 
that  the  time  latency  effect  is  negligible  for  the  benchmark  circuits.  By  contrast,  itera¬ 
tion  latency  accounts  for  a  speed  improvement  close  to  2  on  a  uniprocessor. 

The  results  of  the  FWT  program  with  latency  enabled  are  given  in  Tables  7.5  and 
7.6.  The  reference  times  in  computing  the  speedups  are  the  Gauss-Seidel  uniprocessor 
run  times  with  latency  exploitation.  As  expected,  the  speedups  are  generally  less  than 
those  obtained  without  latency.  This,  of  course,  does  not  mean  that  the  run  times  are 
longer  when  latency  is  used,  it  only  means  that  the  contribution  of  parallel  processing  is 
reduced  somewhat. 

The  performance  of  the  TPPL  program  is  summarized  in  Tables  7.7  and  7.8.  The 
speedups  are  computed  with  respect  to  the  FWT  program  with  latency  running  on  1 
processor.  Since  TPPL  exploits  iteration  latency  over  partial  windows  and  FWT 
exploits  iteration  latency  only  over  entire  windows.  TPPL  exhibits  a  speed  improve¬ 
ment  over  FWT  even  on  1  processor  in  some  cases.  In  cases  marked  with  asterisks,  the 
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Table  7.4.  Uniprocessor  Speedups  Due  to  Latency:  Gauss-Seidel 


Circuit 

No 

Latency 

Iteration 

Latency 

Time 

Latency 

Both 

Latencies 

dvs 

1.39 

1.00 

1.37 

dpla 

1.74 

1.02 

1.81 

scdac 

1.84 

0.99 

1.82 

ben2k 

1.70 

0.99 

1.79 

_ 

2.51 

0.98 

2.58 

Table  7.5.  FWT  Gauss- Seidel  Speedups  with  Latency 


Circuit 

1 

Processors 

2  4 

8 

dvs 

1.0 

1.4 

1.3 

1.3 

dpla 

1.0 

1.7 

2.1 

1.9 

scdac 

1.0 

1.9 

3.3 

4.0 

ben2k 

1.0 

1.6 

1.8 

1.8 

digfi 

1.0 

1.9 

2.9 

3.1 

Table  7.6.  FWT  Gauss-Jacobi  Speedups  with  Latency.  Normalized  to  Gauss-Seidel 


Circuit 

1 

Processors 

2  4 

8 

dvs 

0.7 

1.3 

2.0 

2.6 

dpla 

0.7 

1.3 

2.4 

3.3 

scdac 

0.6 

1.2 

2.1 

3.2 

ben2k 

0.5 

1.0 

1.9 

2.7 

digfi 

0.6 

1.2 

2.1 

3.4 

Table  7.7.  TPPL  Gauss-Seidel  Speedups 


Circuit 

Processors 

1  8 

dvs 

dpla 

scdac 

ben2k 

digfi 

1.1  2.4 

1.1  3.5 

1.2  7.8 

0.6*  1.6* 

1.2  5.2 

Table  7.8.  TPPL  Gauss-Jacobi  Speedups.  Normalized  to  Gauss-Seidel 


Circuit 

Processors 

1  8 

dvs 

0.6 

2.9 

dpla 

0.6 

3.4 

scdac 

0.6 

4.1 

ben2k 

0.2* 

1.3* 

digfi 

0.6 

3.3* 

tolerances  had  to  be  reduced  below  the  default  values.  For  the  ben2k  circuit,  the  itera¬ 
tion  latency  tolerance,  local  truncation  error  tolerance,  and  Newton  convergence  toler- 
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ance  were  all  reduced  in  order  to  obtain  accurate  waveforms,  and  this  resulted  in  a 
significant  degradation  in  the  run  times.  For  the  digji  run  marked  with  an  asterisk,  the 
iteration  latency  tolerance  had  to  be  reduced  to  avoid  a  failure  of  the  integration  algo¬ 
rithm  at  an  effective  input  waveform  discontinuity.  In  the  cases  which  did  not  experi¬ 
ence  such  difficulties,  the  TPPL  performance  on  8  processors  surpassed  the  FWT  perfor¬ 
mance  on  8  processors. 

The  Gauss-Seidel  TPPL  speedups  compare  favorably  with  the  results  excluding 
latency  in  Table  6.1.  The  superior  speedups  are  due  to  the  use  of  partial  window  itera¬ 
tion  latency,  rather  than  to  an  increase  in  parallelism.  The  Gauss-Jacobi  speedups  with 
latency  are  significantly  less  than  those  obtained  without  latency.  In  the  Gauss-Jacobi 
case,  the  effect  demonstrated  in  Table  7.2  is  especially  applicable,  since  a  critical  path  in 
an  augmented  Gauss-Jacobi  task  graph  normally  consists  of  instances  of  the  largest  sub¬ 
circuit.  which  is  unlikely  to  be  latent  over  the  entire  window.  A  critical  path  in  a 
Gauss-Seidel  task  graph  normally  contains  a  mixture  of  different  subcircuits,  some  of 
which  may  be  latent  and  others  which  are  not.  Consequently,  when  Gauss-Jacobi  time 
point  pipelining  is  used  for  a  circuit  which  is  dominated  by  a  large  subcircuit,  latency 
does  not  have  a  significant  effect  on  the  parallel  execution  time  if  the  number  of  proces¬ 
sors  is  large  compared  to  the  available  parallelism.  However,  the  parallel  run  time  is 
reduced  in  those  cases  where  the  number  of  processors  is  small  compared  to  the  avail¬ 
able  parallelism,  due  to  the  reduction  in  the  number  of  computations.  This  can  be 
observed  more  clearly  in  Table  7.9  where  the  run  times  of  Gauss-Jacobi  time  point  pipe¬ 
lining  on  8  processors  show  little  improvement  due  to  latency  for  the  small  circuits  dvs 
and  dpla.  but  show  a  significant  improvement  for  digji,  where  the  subcircuit  partition¬ 
ing  is  more  uniform  and  where  the  number  of  processors  is  small  compared  to  the  avail¬ 
able  parallelism. 
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Table  7.9.  Transient  Analysis  Run  Times 


Method 

Procs. 

dvs 

dpi* 

Circuit 

scdac 

ben2k 

§su 

Direct 

1 

57 

54 

420 

297 

1342 

FWT-GS 

8 

91 

55 

209 

278 

501 

FWT-GJ 

8 

40 

27 

218 

150 

403 

TPP-GS 

8 

54 

39 

173 

218 

419 

TPP-GJ 

8 

32 

24 

218 

160 

433 

FWT-L-GS 

8 

70 

33 

121 

168 

216 

FWT-L-GJ 

8 

34 

25 

162 

105 

201 

TPP-L-GS 

8 

40 

19 

73 

210* 

TPP-L-GJ 

8 

31 

19 

138 

257* 

Key:  L-latency  exploited:  '-tighter  tolerances  used  | 

Finally.  Table  7.9  shows  the  run  times  of  the  different  parallel  waveform  relaxa¬ 
tion  methods  on  8  processors,  and  the  run  times  of  the  direct  method  on  1  processor. 
Parallel  waveform  relaxation  on  8  processors  is  faster  than  the  direct  method  on  1  pro¬ 
cessor.  even  if  latency  is  not  exploited  and  even  though  these  circuits  do  not  have  good 
waveform  relaxation  performance  on  1  processor.  Further  performance  improvements 
are  attainable  with  latency  exploitation  on  8  processors,  but  this  performance  advantage 
comes  at  the  expense  of  reduced  reliability. 
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CHAPTER  8 

CONCLUSIONS 


Four  different  parallel  forms  of  the  basic  waveform  relaxation  algorithm  with 
windowing  have  been  studied.  These  parallel  algorithms  use  Gauss-Seidel  or  Gauss- 
Jacobi  relaxation  in  combination  with  either  the  full  window  technique  or  the  time 
point  pipelining  technique  for  coordinating  the  parallel  execution  of  computations.  The 
superiority  of  Gauss-Seidel  over  Gauss- Jacobi  for  the  solution  of  MOS  digital  circuits 
on  a  uniprocessor  is  well  known.  The  use  of  Gauss-Jacobi  relaxation  for  circuit  simula¬ 
tion  has  been  largely  avoided  in  previous  work,  because  more  iterations  are  required  for 
convergence.  However,  when  sufficiently  many  processors  are  available,  the  overall  run 
time  of  Gauss-Jacobi  is  less  than  that  of  Gauss-Seidel.  This  relationship  between 
Gauss-Seidel  and  Gauss-Jacobi  has  been  established  formally  for  the  linear  algebraic 
equation  case  in  a  theorem  which  relates  the  spectral  radii  of  the  Gauss-Seidel  and 
Gauss-Jacobi  iteration  matrices  to  the  relative  degrees  of  available  parallelism.  Results 
of  the  PARASITE  parallel  simulation  time  estimator,  which  produces  accurate  estimates 
of  the  parallel  run  times  neglecting  overhead,  shows  that  the  extra  parallelism  of 
Gauss-Jacobi  waveform  relaxation  is  more  than  sufficient  to  result  in  faster  run  times 
than  Gauss-Seidel.  when  the  number  of  processors  is  sufficiently  large.  These  results 
are  confirmed  by  actual  multiprocessor  circuit  simulations,  using  the  FWT  and  TPP 
parallel  waveform  relaxation  programs  running  on  an  Alliant  FX/S.  For  4  of  the  S 
benchmark  circuits,  using  the  full  window  technique,  the  performance  of  the  Gauss- 
Jacobi  method  surpasses  that  of  Gauss-Seidel  on  8  processors.  For  the  other  circuit. 
PARASITE  indicates  that  Gauss-Jacobi  will  be  faster  than  Gauss-Seidel  on  16  or  more 


processors. 

Time  point  pipelining  exposes  more  parallelism  than  the  full  window  technique, 
and  it  introduces  additional  overhead.  The  results  of  the  FWT  and  TPP  programs 
confirm  that  time  point  pipelining  produces  faster  run  times  than  the  full  window  tech¬ 
nique  in  those  cases  where  the  full  window  technique  does  not  expose  enough  parallel¬ 
ism  to  keep  the  processors  busy.  The  previously  unexplored  combination  of  tune  point 
pipelining  and  the  Gauss-Jacobi  method  has  been  shown  to  be  the  fastest  of  the  four 
bask  parallel  waveform  relaxation  algorithms  when  the  number  of  processors  is  large 
enough.  PARASITE  estimates  indicate  that  speedups  of  about  one  order  of  magnitude 
should  be  possible  on  about  32  processors  for  1000-transistor  circuits,  where  the 
speedup  is  computed  with  respect  to  the  Gauss-Seidel  method  on  a  single  processor. 

The  available  parallelism  of  the  Gauss-Seidel  and  Gauss-Jacobi  methods  can  be 
estimated  prior  to  simulating  a  circuit,  by  performing  a  computationally  inexpensive 
analysis  of  the  task  graphs.  A  presimulation  selection  procedure  has  been  presented 
which  uses  these  estimates  to  choose  either  the  Gauss-Seidel  or  Gauss-Jacobi  method  for 
simulating  a  given  circuit  on  a  given  number  of  processors.  Furthermore,  by  using  the 
parallelism  estimates  to  predict  the  potential  processor  utilization,  the  selection  pro¬ 
cedure  chooses  either  the  full  window  technique  or  time  point  pipelining. 

The  uniprocessor  speed  of  the  bask  waveform  relaxation  algorithm  is  improved 
significantly  when  iteration  latency  (or  partial  waveform  convergence)  is  exploited.  On 
parallel  processors,  iteration  latency  exploitation  reduces  the  amount  of  parallelism  in 
the  benchmark  circuit  examples,  but  the  overall  performance  is  still  better  when 
latency  is  exploited.  Since  latency  exploitation  reduces  the  parallelism  to  a  greater 
extent  for  the  algorithms  with  greater  parallelism,  the  differences  in  run  times  of  the 
different  algorithms  are  reduced  when  latency  is  exploited.  The  choke  of  the  latency 


tolerances  involves  a  tradeoff  between  greater  latency  exploitation  and  reliability,  since 
loose  tolerances  can  lead  to  inaccurate  waveforms  or  failures  of  the  integration  algo¬ 
rithm.  The  reliability  problems  are  most  severe  when  time  point  level  iteration  latency 
is  used. 

Waveform  relaxation  on  a  uniprocessor  works  best  for  circuits  which  can  be  parti¬ 
tioned  into  a  large  number  of  weakly  coupled  subcircuits  of  nearly  the  same  size.  These 
are  the  same  circuit  properties  which  lead  to  good  speedups  in  parallel  waveform  relax¬ 
ation.  Even  though  some  of  the  benchmark  circuits  used  in  obtaining  the  experimental 
results  do  not  satisfy  this  criterion,  the  parallel  waveform  relaxation  run  times  on  8 
processors  are  significantly  less  than  the  uniprocessor  run  times  of  both  waveform 
relaxation  and  the  standard  direct  methods.  Further  speed  improvements  ere  possible 
from  the  natural  parallelism  of  waveform  relaxation  if  more  processors  are  used. 
When  large  subcircuits  arise  due  to  large  subsets  of  tightly  coupled  nodes,  the  seme 
techniques  used  to  parallelize  the  standard  direct  methods  can  be  used  within  the  Urge 
subcircuits  to  expose  levels  of  parallelism  which  go  beyond  the  natural  reUxation 
parallelism  studied  in  this  thesis. 
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